The Orchestra Metaphor

Think of Spark like a symphony orchestra. You don't just throw 100 musicians on stage and hope for the best. There's a conductor (Driver), section leaders (Cluster Manager), and musicians (Executors) β€” each with a specific role.

🎼 Driver (Conductor)

Runs your main() program. Creates SparkContext. Builds the DAG. Schedules tasks. Collects results back to you.

🎫 Cluster Manager (Stage Manager)

Allocates resources. Launches executors on worker nodes. Types: Standalone, YARN, Mesos, Kubernetes.

🎡 Executors (Musicians)

Run on worker nodes. Execute assigned tasks. Store data in memory/disk. Report results back to Driver.

πŸ’» Worker Node (Music Stand)

A machine in the cluster. Can run one or more Executors. Does the actual data processing work.

Data Flow Animation

Watch how a Spark job travels through the architecture, step by step:

πŸ‘€
Your Code
β†’
🎼
Driver
SparkContext
β†’
🎫
Cluster Mgr
YARN / Mesos
β†’
🎡
Executors
Worker Nodes
Click "Next Step" to trace the journey of your Spark job…

The DAG β€” Spark's Secret Weapon

DAG stands for Directed Acyclic Graph. It's Spark's execution plan β€” a roadmap of every transformation and action your job needs to perform.

1

Code Interpretation

Your Spark code is interpreted. Each transformation (map, filter, join) becomes a node in the DAG.

2

DAG Construction

Spark builds the operator graph β€” connecting all transformations in order.

3

Stage Splitting

The DAG Scheduler splits the graph into stages based on shuffle boundaries. Each stage contains pipelined tasks.

4

Task Scheduling

The Task Scheduler sends individual tasks to Executors via the Cluster Manager.

5

Execution

Executors run tasks on their data partitions and send results back to the Driver.

πŸ’‘ Why DAG matters in interviews

"Spark's DAG optimizer can rearrange and pipeline operations for efficiency — unlike MapReduce which has a rigid map→shuffle→reduce pattern. This is a key reason Spark is faster."

Deployment Modes: Client vs Cluster

Interviewers love this one. There are two modes for running Spark:

πŸ’» Client Mode

Driver runs on YOUR machine (the machine that submitted the job). Good for debugging and interactive shells. If your machine disconnects, the job dies.

☁️ Cluster Mode

Driver runs ON the cluster (managed by YARN/Mesos). Better for production. Survives client disconnection. Less network latency between Driver and Executors.

🎯 When to use which?

Client mode: development, debugging, interactive analysis (Spark Shell). Cluster mode: production jobs, when your machine is far from the cluster, when the job must survive disconnections.

Cluster Manager Types

Standalone

Spark's built-in manager. Simple to set up. Start master + workers manually. Good for dev/testing.

YARN

Hadoop's resource manager. Most popular in enterprise. Two sub-modes: yarn-client and yarn-cluster.

Mesos

Apache Mesos provides dynamic partitioning between Spark and other frameworks. Less common today.

Kubernetes

Modern container orchestration. Growing fast in cloud-native deployments. Each executor runs as a pod.

Quiz: Architecture

Q1: What is the SparkContext responsible for?

A) Storing data on HDFS
B) Coordinating the Spark application β€” connecting to the cluster manager, creating RDDs, and scheduling tasks
C) Running SQL queries only
D) Managing network security

Q2: Your Spark job is running in Client mode and your laptop goes to sleep. What happens?

A) Nothing β€” the cluster takes over
B) The entire job fails because the Driver was on your laptop
C) The job pauses and resumes when you wake up

Q3: What does the DAG Scheduler do with the execution plan?

A) Sends the entire plan to one executor
B) Converts it directly to machine code
C) Splits the DAG into stages based on shuffle boundaries, then passes tasks to the Task Scheduler
D) Writes the plan to HDFS for later execution

Q4: For a production pipeline running 24/7 on a remote cluster, which deployment mode should you use?

A) Client mode β€” so you can watch the logs on your laptop
B) Cluster mode β€” the Driver runs on the cluster and survives client disconnections
C) It doesn't matter β€” both modes are identical for production

Key Internal Daemons

These behind-the-scenes processes keep Spark running. Know them for senior-level questions:

🧠 DAGScheduler

Converts logical DAG into physical stages. Determines which tasks can be pipelined together vs. which need shuffles.

πŸ“¦ TaskScheduler

Takes stages from DAGScheduler and sends tasks to Executors. Implements scheduling policies (FIFO or Fair).

πŸ“¦ BlockManager

Manages storage of data blocks across the cluster. Handles caching, shuffle data, and broadcast variables.

πŸ’Ύ MemoryStore

Manages in-memory storage for cached RDDs and broadcast variables within each Executor.

// Executor memory calculation Nodes = 10 Cores per node = 15 RAM per node = 61 GB Cores per executor = 5 // rule of thumb Executors/node = 15/5 = 3 Total executors = 10 Γ— 3 = 30
We have 10 machines in our cluster
Each machine has 15 CPU cores
Each machine has 61 GB of RAM
 
Best practice: 5 cores per executor
So each machine runs 3 executors
🎯 Total: 30 executors for the whole job