Module 9: Optimization & Bottleneck Detection

The Bottleneck Triage Mindset

When a Spark job is slow, don't guess — follow a systematic triage process. The first step is always classification. Every performance problem falls into one of three buckets. Knowing which bucket you're in determines what you investigate next.

🧑‍💻 Code Issue

Excessive shuffle, data skew, wrong join strategies, repeated full-table scans, Python UDF serialization overhead. These are problems in YOUR code or query plan.

🖥️ Cluster Issue

High executor pending time, slow dynamic allocation, low cluster utilization, node-level memory or CPU pressure, YARN queue starvation. These are infrastructure problems — your code might be fine.

🔀 Mixed Issue

High shuffle combined with spill, GC pressure, or OOM errors. Code inefficiency amplified by insufficient cluster resources. Needs fixes on both sides.

💡 Aha!

In interviews, always classify the problem FIRST, then dive into metrics. Saying "Let me first determine if this is a code issue, cluster issue, or both" shows structured thinking and instantly sets you apart from candidates who jump straight to random config tuning.

Priority Checklist — What to Check First

When debugging a slow Spark job, check these metrics in this exact order. Each metric either confirms or eliminates an entire category of problems, letting you converge on the root cause fast.

#	Metric to Check	Why Check This First
1	Slowest stage runtime	Finds the real bottleneck — ignore fast stages, focus all effort on the stage consuming the most wall-clock time
2	Shuffle read/write volume	Tells you whether data movement across the network is the dominant cost — if shuffle is low, look elsewhere
3	Max vs median task time	Detects skew immediately — if max is 10x median, one partition has way more data
4	GC time percentage	Detects memory pressure — if GC > 10% of task time, the JVM is struggling to manage heap
5	Spill to disk	Detects memory shortage or oversized partitions — disk I/O is orders of magnitude slower than memory
6	Scheduler delay / pending executors	Separates cluster wait from actual execution — high delay means resources aren't available, not that code is slow
7	Join strategy in SQL plan	Finds bad query plans — a `SortMergeJoin` on a tiny table means a missed broadcast opportunity
8	Input size / records read	Shows whether too much data is being scanned — if you only need 1 day but read 365 days, that's the problem

💡 Aha!

Memorize this priority list. When an interviewer asks "Your Spark job is slow — walk me through how you'd debug it," recite this checklist top to bottom. It demonstrates a methodical, production-tested debugging approach rather than random guessing.

Shuffle & Join Optimization

Shuffle is the #1 performance killer in Spark. Every shuffle means serializing data, writing to disk, transferring across the network, and deserializing on the other side. The join strategy you choose directly controls how much shuffle happens.

📡 Broadcast Hash Join

Small table (< 10MB) is broadcast to every executor. The large table stays put — zero shuffle. Fastest join when one side is small enough.

🔄 Sort-Merge Join

Default for large-large joins. Both sides are sorted and repartitioned by the join key. Reliable but expensive — requires a full shuffle of both datasets.

🗂️ Shuffle Hash Join

One side fits in memory. Data is shuffled by key, then a hash table is built in memory on the smaller side. Faster than sort-merge when one side is moderately small.

from pyspark.sql.functions import broadcast # Force broadcast join result = df_large.join( broadcast(df_small), "customer_id" ) # Or set the threshold spark.conf.set( "spark.sql.autoBroadcastJoinThreshold", "20m" # 20MB )

Import the broadcast hint function
 
Join the large table with the small table
The broadcast() hint tells Spark:
"Send df_small to every executor in full"
Join on customer_id — no shuffle needed!
 
Alternatively, raise the auto-broadcast
threshold so Spark broadcasts any table
under 20MB automatically

💡 Aha!

Rule of thumb: if one table is < 10MB, always broadcast it. The default spark.sql.autoBroadcastJoinThreshold is 10MB. If you're certain a table is small, use the broadcast() hint to force it — Catalyst's size estimates can be wrong, especially after filters.

Data Skew — Detection & Fixes

Data skew means a few partitions have vastly more data than others. The result: a few tasks run for 15 minutes while 99 others finish in 10 seconds. Your job is only as fast as its slowest task.

🔍 Detect

In the Spark UI Stages tab, compare Max task time vs Median. If max is 10x+ the median, you have skew. Also check "Shuffle Read Size" per task for uneven distribution.

🧂 Salt the Key

Add a random prefix (0-N) to the skewed join key on both sides, join on the salted key, then remove the prefix. Spreads one hot partition across N partitions.

⚡ AQE Skew Join

Set spark.sql.adaptive.skewJoin.enabled=true — Spark 3.0+ automatically detects and splits skewed partitions at runtime. No code changes needed.

🔀 Isolated Join

Filter out skewed keys, broadcast-join them separately (since they're few keys with lots of rows), then union the result back with the normal join. Surgical fix.

from pyspark.sql.functions import ( col, concat, lit, floor, rand ) # Step 1: Salt the skewed table N = 10 # number of salt buckets df_big_salted = df_big.withColumn( "salted_key", concat(col("join_key"), lit("_"), floor(rand() * N).cast("string")) ) # Step 2: Explode the small table from pyspark.sql.functions import explode, array df_small_exploded = df_small.withColumn( "salt", explode(array([lit(i) for i in range(N)])) ).withColumn( "salted_key", concat(col("join_key"), lit("_"), col("salt")) ) # Step 3: Join on salted key result = df_big_salted.join( df_small_exploded, "salted_key" ).drop("salted_key", "salt")

Import the functions we need for salting
 
Choose 10 salt buckets — each hot key
gets split into 10 smaller partitions
Append a random number (0-9) to each
join key: "hot_key" → "hot_key_7"
 
The small table must match ALL salt
values, so we duplicate each row 10x
with salt values 0 through 9
Create matching salted keys
 
Now the join distributes evenly!
Clean up the temporary columns

💡 Aha!

AQE skew handling is the modern answer — it's automatic and requires zero code changes. But know the salting technique too — interviewers love to see you can solve it manually. It proves you understand WHY the fix works, not just which config to flip.

Memory: GC, Spill & OOM

Memory problems are the second most common Spark performance killer after shuffle. They manifest in three ways: high GC time, spill to disk, and outright OOM crashes. Here's how to diagnose and fix each one.

Symptom	What's Happening	Fix
High GC Time (>10%)	JVM is spending too much time collecting garbage instead of running your tasks. Objects are being created and destroyed faster than the heap can handle.	Increase `spark.executor.memory`, reduce cached data volume, switch cache level to `MEMORY_AND_DISK`
Spill to Disk	Data being processed in a task doesn't fit in execution memory, so Spark serializes it to local disk. Disk I/O is 100x slower than memory access.	Increase `spark.executor.memory`, reduce partition size via more partitions, run fewer concurrent tasks per executor
Executor OOM / Lost	JVM heap is completely exhausted. The executor crashes and its tasks must be retried elsewhere. Often caused by skew or container overhead.	Increase `spark.executor.memoryOverhead`, reduce partition size, check for skew concentrating data in one task
High Memory Fraction Used	Storage memory (cached DataFrames) and execution memory (shuffles, sorts, aggregations) are competing for the same unified pool.	Tune `spark.memory.fraction` (default 0.6) and `spark.memory.storageFraction` (default 0.5). Unpersist caches you no longer need.

💡 Aha!

The memory formula interviewers expect you to know: Total Executor Memory = spark.executor.memory + spark.executor.memoryOverhead. Default overhead is max(384MB, 10% of executor memory). For PySpark jobs, also account for spark.executor.pyspark.memory — Python processes run outside the JVM heap and need their own allocation.

Predicate Pushdown & Partition Pruning

Predicate pushdown means filters are pushed to the data source — Spark tells Parquet/ORC "only give me rows where date = '2024-01-01'" so it skips entire row groups. Partition pruning goes further — only the relevant partition directories are read from disk at all.

# Write a partitioned table df.write.partitionBy("date") \ .parquet("s3://bucket/sales/") # Query with pushdown-friendly filter result = spark.read.parquet( "s3://bucket/sales/" ).filter( "date = '2024-01-01'" ).select("amount", "customer_id") # Verify with explain() result.explain("formatted") # Look for: # PushedFilters: [EqualTo(date,2024-01-01)] # PartitionFilters: [date = 2024-01-01]

Table is saved partitioned by date —
each date gets its own directory on disk
 
Read the partitioned Parquet table
 
Filter for a single day — Spark pushes
this to the Parquet reader level
Only select the columns we need
 
Use explain() to verify optimization
PushedFilters: filter sent to Parquet
reader — only matching row groups scanned
PartitionFilters: only the 2024-01-01
partition directory is read from disk

🚫 What Breaks Pushdown

UDFs in filter expressions, complex casts like cast(date as string), non-native data sources, filters applied after a shuffle, and expressions Parquet can't evaluate (e.g., regex).

✅ How to Verify

Run df.explain("formatted") and look for PushedFilters in the scan node. If it says PushedFilters: [] — pushdown failed and you're reading everything.

💡 Aha!

If you see a full table scan in explain(), check if a UDF or cast() is preventing pushdown. The fix: replace with native Spark SQL expressions. For example, instead of a UDF that extracts year, use year(col("date")) — Spark can push native functions down to the data source.

Adaptive Query Execution (AQE)

AQE (Spark 3.0+) is a game-changer: it optimizes queries at runtime based on actual data statistics collected between stages. Unlike the Catalyst optimizer which plans everything upfront using potentially stale or inaccurate stats, AQE adapts as it goes.

📦 Coalescing Partitions

spark.sql.adaptive.coalescePartitions.enabled — after a shuffle, AQE merges tiny partitions into larger ones. Eliminates the "too many small tasks" problem without manual repartitioning.

⚖️ Skew Join Optimization

spark.sql.adaptive.skewJoin.enabled — automatically detects partitions that are much larger than the median and splits them into smaller sub-partitions. No salting code needed.

🔄 Dynamic Join Strategy

If runtime statistics reveal one side of a join is actually small (after filters reduced it), AQE switches from sort-merge join to broadcast join on the fly.

⚙️ Config

spark.sql.adaptive.enabled=true — the master switch. On by default in Spark 3.2+. For Spark 3.0–3.1, you need to enable it explicitly. Almost always a performance win.

💡 Aha!

AQE is almost always a win. The key interview answer: "AQE re-optimizes the query plan between stages using runtime statistics, unlike the Catalyst optimizer which only uses static estimates at planning time." This one sentence shows you understand both the what and the why. Follow up with the three features: coalescing, skew handling, and dynamic join switching.

Decision Table — Symptom → Root Cause → Fix

This is your cheat sheet. When you see a symptom in the Spark UI, follow the chain: check the next metric, identify the root cause, apply the fix. Memorize the top 5 rows — they cover 80% of real-world performance issues.

If This Is High	Check Next	Likely Root Cause	Quick Fix
Scheduler Delay	Active executors, pending resources	Cluster / resources — not enough executors available	Add executors, check YARN queue limits, adjust `spark.dynamicAllocation`
Shuffle Read/Write	Join type, repartition calls, groupBy	Code / query design — too much data being moved	Use broadcast joins, replace `groupByKey` with `reduceByKey`, minimize shuffles
Disk Spill	Executor memory, partition size	Memory / config — data doesn't fit in execution memory	Increase `spark.executor.memory`, increase partition count
GC Time	Executor memory, cached data volume	Memory pressure — JVM heap under stress	Tune memory, unpersist unused caches, use `MEMORY_AND_DISK`
Max >> Median task time	Key distribution, partition sizes	Data skew — a few partitions have most of the data	Salt keys, enable AQE skew join, isolated join for hot keys
Input Size	Filter pushdown, partition pruning in plan	Reading too much data — full table scan	Add filters early, use partitioned tables, verify pushdown with `explain()`
Output Write Time	File count, output partition count	Storage bottleneck — too many small files or slow sink	Coalesce output before write, use efficient formats (Parquet/ORC)
Runtime Varies Across Runs	Cluster load, concurrent jobs	Cluster contention — sharing resources with other jobs	Isolate workloads, schedule off-peak, use dedicated queues
Lost Executors / OOM	Memory overhead, partition size, skew	Config + code — container memory exceeded	Increase `memoryOverhead`, fix skew, reduce partition size
Too Many / Few Tasks	Partition count vs data size	Over or under-partitioning	Tune `spark.sql.shuffle.partitions` (default 200), use AQE coalescing
Python UDF Stages Slow	UDF in hot path, ser/deser overhead	Serialization overhead — data moved between JVM and Python	Replace with native Spark functions or `pandas_udf` (Arrow-based, vectorized)

💡 Aha!

In a real interview, you won't recite this entire table. But knowing it lets you instantly pattern-match any scenario they throw at you. Practice by picking a random symptom and tracing through: symptom → check → cause → fix. That's exactly how senior engineers debug in production.

Quiz: Optimization & Bottleneck Detection

Test your debugging instincts. Each question simulates a real interview scenario — pick the best answer, then check the explanation.

Q1: Your Spark job has 100 tasks in a stage. 99 finish in 10 seconds, but 1 takes 15 minutes. What's the most likely problem?

A) The cluster doesn't have enough executors

B) Data skew — one partition has far more data than the others

C) The shuffle partitions setting is too low

D) GC pressure on the driver node

✅ Data skew. When max task time is 90x the median, one partition has vastly more data. Fix with key salting, AQE skew join, or an isolated broadcast join for the hot keys.

Q2: You see 20GB of "Spill to Disk" in the Stages tab. What should you do FIRST?

A) Add more executors to the cluster

B) Switch to a broadcast join

C) Increase executor memory or reduce concurrent tasks to give each task more memory

D) Enable AQE coalescing

✅ Spill means data doesn't fit in execution memory. Increasing spark.executor.memory or reducing the number of partitions processed concurrently gives each task more memory to work with, eliminating the need to spill to disk.

Q3: Your explain() plan shows SortMergeJoin, but one side is only 5MB. What optimization is missing?

A) Predicate pushdown

B) Partition pruning

C) Broadcast join — the small table should be broadcast to avoid shuffling

D) AQE coalescing

✅ A 5MB table is well under the 10MB broadcast threshold. Check that spark.sql.autoBroadcastJoinThreshold isn't set to -1 (disabled), or use broadcast() hint to force it. This eliminates the shuffle on both sides.

Q4: What does AQE do differently than the Catalyst optimizer?

A) AQE uses rule-based optimization instead of cost-based

B) AQE re-optimizes at runtime using actual data statistics, while Catalyst uses static estimates at planning time

C) AQE replaces Catalyst entirely in Spark 3.0+

D) AQE only handles skew; Catalyst handles everything else

✅ The key distinction: Catalyst plans upfront with static estimates (which can be wrong). AQE collects real statistics between stages and re-optimizes — switching join strategies, coalescing partitions, and splitting skewed partitions based on what actually happened.

Q5: A pipeline reads a 500GB Parquet table but only needs 1 day's data. The filter is applied AFTER the scan in the plan. What's wrong?

A) The table isn't cached

B) Too few shuffle partitions

C) Predicate pushdown isn't working — likely a UDF or cast preventing it

D) The executor memory is too low

✅ If the filter appears after the scan in explain(), pushdown failed. Common causes: UDFs in filter expressions, type casts, or complex expressions Parquet can't evaluate. Replace with native Spark SQL expressions so the filter gets pushed to the Parquet reader.

Q6: Your job's runtime doubles when the cluster is busy but stays consistent off-hours. Is this a code issue or cluster issue?

A) Code issue — the query plan changes under load

B) Cluster issue — the job is competing for resources with other workloads

C) Mixed issue — both code and cluster need fixing

D) Neither — this is expected behavior for any distributed system

✅ Cluster issue. If code were the problem, runtime would be consistently slow regardless of cluster load. Variable runtime that correlates with cluster utilization means resource contention. Check scheduler delay and executor allocation time. Fix: dedicated queues, off-peak scheduling, or guaranteed resource allocation.