Hive Default Partitioning

The partition is not defined by the user it is decided based on the data and its hash value.
Bucketing is mainly used when we want to sample data. It is also used when we have indefinite data and cannot be partitioned
Bucketing is the concept used by MapReduce to create partitions
Bucketing on multiple columns is possible. Both the column together is taken as key
Formula: Hash Value (Key) mod (No. of Bucket)

INFO

To create a table in Hive DB it should be created using Hive CLI

Bucketing is not supported by Spark so it should be done using Hive CLI

CREATE TABLE bucket_patient(pid int, pname string, drug string, gender string, amt int)
CLUSTERED BY(drug) INTO 4 BUCKETS;
 
INSERT OVERWRITE TABLE bucket_patient SELECT * FROM patient;
 
SELECT * FROM bucket_patient TABLESAMPLE(BUCKET 1 OUT OF 4 ON drug);
 
SELECT DISTINCT(drug) FROM bucket_patient TABLESAMPLE(BUCKET 1 OUT OF 4 ON drug);
 
SELECT * FROM bucket_patient TABLESAMPE(50 percent);
 
SELECT * FROM bucket_patient LIMIT 5;

Digital Archive

Explorer

Hive Default Partitioning

Backlinks

Graph View