Module 3: Shuffle, Skew & Joins | Cost Optimization Mastery

🔀 What Is Shuffle?

Shuffle is data redistribution across executors — triggered by joins, groupBy, distinct, and repartition. It's the most expensive operation in distributed computing.

Executor 1

⇄

Network Transfer

⇄

Executor 2

During shuffle, every executor may need to send data to every other executor. A 200-partition job creates 200 × 200 = 40,000 potential data transfers.

⚠️ Why Shuffle Is Expensive

Disk I/O — spills to local disk when memory is full. Network I/O — cross-node data transfer. Serialization — encode/decode overhead. GC pressure — large intermediate buffers.

Reducing Shuffle in Spark

Every avoided shuffle saves compute, network, and time. Here are the strategies:

📡 Broadcast Small Tables

If one join side is < 10 MB (default), Spark broadcasts it to all executors — zero shuffle on the large side.

🔽 Filter Before Join

Apply WHERE clauses before joins. Less input = less data to shuffle. Predicate pushdown helps here.

📊 Pre-Partition Data

If both tables are bucketed on the join key with the same bucket count, Spark skips the shuffle entirely.

🎯 Tune Partitions

Set spark.sql.shuffle.partitions based on data size. Default 200 is often too many or too few. Use AQE for auto-tuning.

# Broadcast join — eliminates shuffle on the large table
from pyspark.sql.functions import broadcast

result = large_df.join(
    broadcast(small_df),   # small_df sent to all executors
    on="customer_id"
)

# AQE auto-coalesces shuffle partitions
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Detecting Data Skew

Data skew happens when one partition has 100x more data than others. The job waits for the one straggler task while 199 executors sit idle.

P1

P2

P3

P4

P5

P6

P7

P8

Partition 5 has 95% of data — the entire job waits for it to finish.

💡 How to Detect Skew

Check the Spark UI: Task Duration tab shows one task taking 10-100x longer. Shuffle Read Size shows one partition much larger. Also check df.groupBy("key").count().orderBy(desc("count")).

Fixing Skew: Salting

Salting breaks one hot key into N sub-keys, spreading data across partitions. After the join, you aggregate to remove the salt.

key = "null"
10M rows

→

Add salt 0-9

→

null_0 ... null_9
1M rows each

# Salting a skewed join key
from pyspark.sql.functions import col, lit, rand, floor, concat, explode, array

salt_buckets = 10

# Add random salt to the skewed (large) side
large_salted = large_df.withColumn(
    "salt", floor(rand() * salt_buckets).cast("int")
).withColumn(
    "salted_key", concat(col("join_key"), lit("_"), col("salt"))
)

# Explode small side to match all salt values
small_exploded = small_df.withColumn(
    "salt", explode(array([lit(i) for i in range(salt_buckets)]))
).withColumn(
    "salted_key", concat(col("join_key"), lit("_"), col("salt"))
)

Join Strategies Compared

Choosing the right join strategy can mean the difference between a $50 job and a $5,000 job.

📡 Broadcast Hash Join

✅ No shuffle on large side
✅ Fastest for small + large joins
⚠️ Small side must fit in memory
Default threshold: 10 MB
Best: dimension lookups

🔀 Sort-Merge Join

✅ Handles any data size
✅ Memory-efficient (streaming)
⚠️ Requires shuffle + sort on both sides
Spark's default for large-large
Best: large + large equi-joins

🔄 Shuffle Hash Join

✅ Faster than SMJ for skewed data
✅ No sort required
⚠️ Builds hash table in memory
Enable via spark.sql.join.preferSortMergeJoin=false
Best: medium tables with high cardinality

🎯 Filter & Column Pruning

✅ Reduces data before any join
✅ Column pruning: read only needed cols
✅ Filter pushdown: fewer rows enter shuffle
Applies to ALL join types
Best: always — do this first

🔀 What Is Shuffle?

Reducing Shuffle in Spark

📡 Broadcast Small Tables

🔽 Filter Before Join

📊 Pre-Partition Data

🎯 Tune Partitions

Detecting Data Skew

Fixing Skew: Salting

Join Strategies Compared

📡 Broadcast Hash Join

🔀 Sort-Merge Join

🔄 Shuffle Hash Join

🎯 Filter & Column Pruning

Quiz: Test Yourself

Q1: Why is shuffle the most expensive operation in Spark?

Q2: You join a 500 GB fact table with a 5 MB dimension table. Best join strategy?

Q3: 80% of your join key values are NULL, causing extreme skew. Best fix?

Q4: What should you always do BEFORE optimizing join strategy?

Q5: What does Adaptive Query Execution (AQE) do for shuffle optimization?