Spark Joins

Joins use more shuffles than any other process in Spark. It always leads to bad performance

Broadcast Hash Join (Map Side Join/ Replicated Join)

Used when we want to compare a small and large table
The entire small table is copied into each partition location of the large table
The table joining happens in the mapper stage itself
Default: 10 Mb (Max size of small table)

# Force broadcast an table
empDF.join(broadcast(deptDF), empDF.emp_dept_id == deptDF.dept_id, "inner").show()

Shuffle Hash Join

Used when we join 2 large tables
The partitions are shuffled to get similar data together (Done for both table)
And then this shuffled partitions are joined together
The values that going to be in a partition after shuffle is decided by hash function
Used for non-indexed data

If we try to join normally for similar data we need to access all the partitions and then are join and reduced into a new partition

Sort Merge Join

Used to join very large tables
Groups the same keys to the same partition
It is the default sort strategy in Spark since v2.3
The join key needs to be orderable (sortable) for this algorithm to work

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast -24 Tutorials