Module 8: Spark UI Deep Dive | Spark Interview Mastery

Why the Spark UI Matters

The Spark UI is your X-ray machine for understanding what's happening inside a Spark application. It launches automatically on port 4040 whenever a SparkContext is active, and the History Server preserves it after jobs complete.

Every serious Spark interviewer expects you to navigate and interpret the UI fluently. Saying "I'd check the Spark UI" is good — explaining exactly which tab, which metric, and what the value means is what gets you hired.

📋 Jobs Tab

Shows every action-triggered job, its duration, stage count, and success/failure status — your 30,000-foot view.

🔬 Stages Tab

The gold mine. Reveals shuffle sizes, spill metrics, GC time, and per-task distributions for each stage.

📊 SQL Tab

Displays the physical query plan as a DAG — see exactly which joins, scans, and exchanges Catalyst chose.

🖥️ Executors Tab

Per-executor health: GC time, memory usage, active/failed tasks, and input/shuffle read volumes.

💡 Interview Insight

When an interviewer asks "how would you debug a slow Spark job?", start with: "I'd open the Spark UI, go to the Stages tab, and compare Max vs Median task duration to check for skew." That single sentence signals deep operational experience.

Jobs Tab — The Big Picture

Every Spark action triggers a Job. Each job is subdivided into Stages at shuffle boundaries. The Jobs tab gives you the bird's-eye view — how many jobs ran, how long each took, and whether any failed.

What You See	What It Means	🚩 Red Flag
Job Duration	Total wall-clock time from action trigger to completion	One job takes 10× longer than similar jobs — likely skew or bad partitioning
Number of Stages	How many shuffle boundaries exist in this job's lineage	More than 5-6 stages usually means excessive shuffles from chained joins/groupBys
Failed Stages	Stages that threw exceptions and had to be retried or abandoned	Any failed stage — check executor logs for OOM errors or data corruption
Skipped Stages	Stages whose output was already cached or persisted, so Spark skipped re-computation	Not a red flag! Skipped stages mean your `cache()` or `persist()` is working

💡 Aha!

If a job has many stages, it means multiple shuffles are happening. Ask yourself: can any of these groupBy, join, or repartition calls be eliminated or combined? Fewer stages = fewer shuffles = faster jobs.

Stages Tab — The Gold Mine

This is THE most important tab in the Spark UI. Click into any stage and you'll find a treasure trove of metrics: shuffle read/write sizes, spill metrics, GC time, and the full task distribution.

Metric	Where to Find	What It Tells You	🚨 Danger Zone
Shuffle Read Size	Stage detail → Aggregate Metrics	How much data this stage pulled from previous stages' shuffle output	Hundreds of GB — consider broadcast joins or pre-filtering
Shuffle Write Size	Stage detail → Aggregate Metrics	How much data this stage produced for downstream stages to read	Write ≫ Read means data is exploding (possible cartesian join)
Spill (Memory)	Stage detail → Aggregate Metrics	Data size in memory before it was spilled — indicates memory pressure	Any spill > 0 means an operation exceeded available memory
Spill (Disk)	Stage detail → Aggregate Metrics	Data actually written to disk due to memory exhaustion	Large disk spill = severe performance hit (10-100× slower than in-memory)
GC Time	Stage detail → Summary Metrics	Time each task spent in Java garbage collection	>10% of task time signals memory pressure or too many objects
Scheduler Delay	Stage detail → Summary Metrics	Time between task being scheduled and actually starting execution	>50ms consistently means the cluster is resource-starved
Task Deser. Time	Stage detail → Summary Metrics	Time to deserialize the task closure on the executor	>100ms means large closures — avoid broadcasting big objects in closures

💡 Aha!

In an interview, if you're asked about debugging a slow stage, mention these three metrics in order: (1) Shuffle Read/Write — is too much data moving? (2) Spill — is data hitting disk? (3) GC Time — is the JVM struggling? This structured approach shows systematic thinking.

Task Metrics — Spotting Skew

Inside each stage, the Summary Metrics section shows Min, 25th percentile, Median, 75th percentile, and Max for every task metric. This is your primary weapon for detecting data skew.

The key insight is simple: if Max ≫ Median, you have skew. One task is doing far more work than the rest, and the entire stage waits for it to finish.

✅ Healthy Distribution

Max ≈ 2× Median. Tasks are processing roughly equal amounts of data. No action needed — this is the target state.

⚠️ Moderate Skew

Max ≈ 5–10× Median. One or a few partitions are heavier. Consider salting keys or increasing partition count to redistribute data.

🔥 Severe Skew

Max ≈ 100×+ Median. A single key dominates. Requires targeted fixes: salting, isolated processing of hot keys, or adaptive query execution (AQE).

💡 Interview Power Move

Always mention checking Max vs Median task time as your first skew detection technique. Then follow up with: "I'd also look at Shuffle Read Size per task to confirm which partition is oversized." This two-step answer demonstrates real debugging experience.

SQL Tab — Reading Query Plans

When you run DataFrame or Spark SQL operations, the SQL tab visualizes the physical plan as a visual DAG. Each node represents an operator, and understanding these nodes lets you predict performance bottlenecks before they happen.

== Physical Plan == *(3) HashAggregate(keys=[dept], functions=[sum(salary)]) +- Exchange hashpartitioning(dept, 200) +- *(2) HashAggregate(partial, keys=[dept], functions=[sum(salary)]) +- *(2) Project [dept, salary] +- *(2) BroadcastHashJoin [emp.dept_id = dept.id] :- *(2) Filter isnotnull(dept_id) : +- *(2) FileScan parquet : [employees] +- BroadcastExchange +- *(1) FileScan parquet [departments]

Physical Plan (read bottom-up):
 
🔷 Final aggregation — combines partial sums into final result per department
 
🔴 Exchange — this is a shuffle! Redistributes data by dept key across 200 partitions
 
🔷 Partial aggregation — each partition computes a local partial sum first (map-side combine)
 
🟢 BroadcastHashJoin — the small departments table is broadcast to all executors (no shuffle!)
 
🟡 FileScan — reads employees and departments parquet files with column pruning

💡 Aha!

Exchange = Shuffle. Every Exchange node in the plan is a shuffle operation that redistributes data across the network. Fewer exchanges = faster query. In interviews, if you see an Exchange before a join, suggest a broadcast join to eliminate it.

Executors & Storage Tabs

The Executors tab gives you per-executor health metrics — think of it as the vital signs monitor for each worker in your cluster. The Storage tab tells you whether your cache() and persist() calls are actually working.

⏱️ GC Time

Watch for >10% of total task time spent in garbage collection. High GC = memory pressure. Fix with more executor memory or better partitioning.

💾 Memory Usage

Shows storage memory and execution memory fractions per executor. Near-max usage signals risk of spill or OOM errors.

❌ Failed Tasks

Uneven failed task counts across executors suggest executor instability — possibly a bad node, disk issues, or network problems.

📦 Storage Fraction

In the Storage tab: shows cached RDDs/DataFrames, what fraction is actually cached, and whether it's in memory or spilled to disk.

💡 Aha!

If one executor has much higher GC time than others, it's almost certainly processing a skewed partition. The fix isn't more memory for that executor — it's fixing the skew so all executors get equal work. Cross-reference with the Stages tab to confirm which partition is oversized.

Spark History Server

The live Spark UI disappears when your SparkContext stops. The History Server is a separate process that reads event logs and re-creates the UI for completed applications. It's essential for post-mortem debugging and comparing job runs over time.

🔍 When to Use

Post-mortem debugging — investigate why last night's batch job failed or was slow. Comparing runs — diff today's metrics against yesterday's to catch regressions. Capacity planning — review historical resource usage patterns.

⚙️ Key Configuration

spark.eventLog.enabled=true — turns on event logging. spark.eventLog.dir=hdfs:///spark-logs — where logs are written. spark.history.fs.logDirectory — where the History Server reads from (same path).

💡 Interview Tip

If asked "how do you debug a Spark job that already finished?", say: "We have the Spark History Server enabled with event logs stored on HDFS/S3. I can pull up the full UI — Stages, SQL plan, executor metrics — for any completed application and compare it against previous successful runs." This shows production-level awareness.

Quiz: Spark UI Mastery

Test your ability to navigate the Spark UI like a pro. These are the kinds of questions that separate senior candidates from the rest.

Q1: Which Spark UI tab is MOST useful for detecting data skew?

A) Jobs tab — check job durations

B) Stages tab — check Max vs Median task time

C) SQL tab — look for Exchange nodes

D) Executors tab — compare memory usage

✅ The Stages tab's Summary Metrics show Min/25th/Median/75th/Max per task — if Max ≫ Median, you have skew. This is the most direct and reliable detection method.

Q2: You see high "Shuffle Read" in the Stages tab. What does this indicate?

A) Data is being cached to memory

B) The executor is running out of disk space

C) Large amounts of data are being moved across the network from joins, groupBys, or repartitions

D) The query plan has too many FileScan nodes

✅ Shuffle Read measures data pulled from other executors' shuffle output. High values mean lots of data redistribution — typically from joins, groupByKey, repartition, or wide transformations.

Q3: In the SQL tab, what does an "Exchange" node represent?

A) A data format conversion (e.g., CSV to Parquet)

B) A shuffle operation — data redistribution across partitions

C) A broadcast of a small table to all executors

D) A checkpoint being written to HDFS

✅ Exchange is Spark's name for a shuffle in the physical plan. Every Exchange node means data is being serialized, written, transferred across the network, and deserialized. Minimizing these is key to performance.

Q4: Your executor shows 40% GC time. What's the likely problem?

A) The executor has too many CPU cores

B) The network bandwidth is insufficient

C) Memory pressure — the executor needs more memory or data needs better partitioning

D) The event log directory is full

✅ GC time above 10% signals memory pressure. At 40%, the JVM is spending nearly half its time cleaning up objects instead of doing useful work. Solutions: increase executor memory, reduce partition size, or fix data skew.

Q5: Where do you check if your cache() is actually being used?

A) Jobs tab — look for skipped jobs

B) Stages tab — look for spill metrics

C) SQL tab — look for InMemoryRelation nodes

D) Storage tab — shows cached RDDs/DataFrames, fraction cached, and memory vs disk usage

✅ The Storage tab is the definitive place to verify caching. It lists every cached RDD/DataFrame, shows what percentage is actually in memory, and whether any data spilled to disk. If your dataset doesn't appear here, your cache() call hasn't been triggered by an action yet.