The Spark UI is your X-ray machine for understanding what's happening inside a Spark application. It launches automatically on port 4040 whenever a SparkContext is active, and the History Server preserves it after jobs complete.
Every serious Spark interviewer expects you to navigate and interpret the UI fluently. Saying "I'd check the Spark UI" is good — explaining exactly which tab, which metric, and what the value means is what gets you hired.
Shows every action-triggered job, its duration, stage count, and success/failure status — your 30,000-foot view.
The gold mine. Reveals shuffle sizes, spill metrics, GC time, and per-task distributions for each stage.
Displays the physical query plan as a DAG — see exactly which joins, scans, and exchanges Catalyst chose.
Per-executor health: GC time, memory usage, active/failed tasks, and input/shuffle read volumes.
When an interviewer asks "how would you debug a slow Spark job?", start with: "I'd open the Spark UI, go to the Stages tab, and compare Max vs Median task duration to check for skew." That single sentence signals deep operational experience.
Every Spark action triggers a Job. Each job is subdivided into Stages at shuffle boundaries. The Jobs tab gives you the bird's-eye view — how many jobs ran, how long each took, and whether any failed.
| What You See | What It Means | 🚩 Red Flag |
|---|---|---|
| Job Duration | Total wall-clock time from action trigger to completion | One job takes 10× longer than similar jobs — likely skew or bad partitioning |
| Number of Stages | How many shuffle boundaries exist in this job's lineage | More than 5-6 stages usually means excessive shuffles from chained joins/groupBys |
| Failed Stages | Stages that threw exceptions and had to be retried or abandoned | Any failed stage — check executor logs for OOM errors or data corruption |
| Skipped Stages | Stages whose output was already cached or persisted, so Spark skipped re-computation | Not a red flag! Skipped stages mean your cache() or persist() is working |
If a job has many stages, it means multiple shuffles are happening. Ask yourself: can any of these groupBy, join, or repartition calls be eliminated or combined? Fewer stages = fewer shuffles = faster jobs.
This is THE most important tab in the Spark UI. Click into any stage and you'll find a treasure trove of metrics: shuffle read/write sizes, spill metrics, GC time, and the full task distribution.
| Metric | Where to Find | What It Tells You | 🚨 Danger Zone |
|---|---|---|---|
| Shuffle Read Size | Stage detail → Aggregate Metrics | How much data this stage pulled from previous stages' shuffle output | Hundreds of GB — consider broadcast joins or pre-filtering |
| Shuffle Write Size | Stage detail → Aggregate Metrics | How much data this stage produced for downstream stages to read | Write ≫ Read means data is exploding (possible cartesian join) |
| Spill (Memory) | Stage detail → Aggregate Metrics | Data size in memory before it was spilled — indicates memory pressure | Any spill > 0 means an operation exceeded available memory |
| Spill (Disk) | Stage detail → Aggregate Metrics | Data actually written to disk due to memory exhaustion | Large disk spill = severe performance hit (10-100× slower than in-memory) |
| GC Time | Stage detail → Summary Metrics | Time each task spent in Java garbage collection | >10% of task time signals memory pressure or too many objects |
| Scheduler Delay | Stage detail → Summary Metrics | Time between task being scheduled and actually starting execution | >50ms consistently means the cluster is resource-starved |
| Task Deser. Time | Stage detail → Summary Metrics | Time to deserialize the task closure on the executor | >100ms means large closures — avoid broadcasting big objects in closures |
In an interview, if you're asked about debugging a slow stage, mention these three metrics in order: (1) Shuffle Read/Write — is too much data moving? (2) Spill — is data hitting disk? (3) GC Time — is the JVM struggling? This structured approach shows systematic thinking.
Inside each stage, the Summary Metrics section shows Min, 25th percentile, Median, 75th percentile, and Max for every task metric. This is your primary weapon for detecting data skew.
The key insight is simple: if Max ≫ Median, you have skew. One task is doing far more work than the rest, and the entire stage waits for it to finish.
Max ≈ 2× Median. Tasks are processing roughly equal amounts of data. No action needed — this is the target state.
Max ≈ 5–10× Median. One or a few partitions are heavier. Consider salting keys or increasing partition count to redistribute data.
Max ≈ 100×+ Median. A single key dominates. Requires targeted fixes: salting, isolated processing of hot keys, or adaptive query execution (AQE).
Always mention checking Max vs Median task time as your first skew detection technique. Then follow up with: "I'd also look at Shuffle Read Size per task to confirm which partition is oversized." This two-step answer demonstrates real debugging experience.
When you run DataFrame or Spark SQL operations, the SQL tab visualizes the physical plan as a visual DAG. Each node represents an operator, and understanding these nodes lets you predict performance bottlenecks before they happen.
dept key across 200 partitionsdepartments table is broadcast to all executors (no shuffle!)employees and departments parquet files with column pruningExchange = Shuffle. Every Exchange node in the plan is a shuffle operation that redistributes data across the network. Fewer exchanges = faster query. In interviews, if you see an Exchange before a join, suggest a broadcast join to eliminate it.
The Executors tab gives you per-executor health metrics — think of it as the vital signs monitor for each worker in your cluster. The Storage tab tells you whether your cache() and persist() calls are actually working.
Watch for >10% of total task time spent in garbage collection. High GC = memory pressure. Fix with more executor memory or better partitioning.
Shows storage memory and execution memory fractions per executor. Near-max usage signals risk of spill or OOM errors.
Uneven failed task counts across executors suggest executor instability — possibly a bad node, disk issues, or network problems.
In the Storage tab: shows cached RDDs/DataFrames, what fraction is actually cached, and whether it's in memory or spilled to disk.
If one executor has much higher GC time than others, it's almost certainly processing a skewed partition. The fix isn't more memory for that executor — it's fixing the skew so all executors get equal work. Cross-reference with the Stages tab to confirm which partition is oversized.
The live Spark UI disappears when your SparkContext stops. The History Server is a separate process that reads event logs and re-creates the UI for completed applications. It's essential for post-mortem debugging and comparing job runs over time.
Post-mortem debugging — investigate why last night's batch job failed or was slow. Comparing runs — diff today's metrics against yesterday's to catch regressions. Capacity planning — review historical resource usage patterns.
spark.eventLog.enabled=true — turns on event logging. spark.eventLog.dir=hdfs:///spark-logs — where logs are written. spark.history.fs.logDirectory — where the History Server reads from (same path).
If asked "how do you debug a Spark job that already finished?", say: "We have the Spark History Server enabled with event logs stored on HDFS/S3. I can pull up the full UI — Stages, SQL plan, executor metrics — for any completed application and compare it against previous successful runs." This shows production-level awareness.
Test your ability to navigate the Spark UI like a pro. These are the kinds of questions that separate senior candidates from the rest.