Think of Spark like a symphony orchestra. You don't just throw 100 musicians on stage and hope for the best. There's a conductor (Driver), section leaders (Cluster Manager), and musicians (Executors) β each with a specific role.
Runs your main() program. Creates SparkContext. Builds the DAG. Schedules tasks. Collects results back to you.
Allocates resources. Launches executors on worker nodes. Types: Standalone, YARN, Mesos, Kubernetes.
Run on worker nodes. Execute assigned tasks. Store data in memory/disk. Report results back to Driver.
A machine in the cluster. Can run one or more Executors. Does the actual data processing work.
Watch how a Spark job travels through the architecture, step by step:
DAG stands for Directed Acyclic Graph. It's Spark's execution plan β a roadmap of every transformation and action your job needs to perform.
Your Spark code is interpreted. Each transformation (map, filter, join) becomes a node in the DAG.
Spark builds the operator graph β connecting all transformations in order.
The DAG Scheduler splits the graph into stages based on shuffle boundaries. Each stage contains pipelined tasks.
The Task Scheduler sends individual tasks to Executors via the Cluster Manager.
Executors run tasks on their data partitions and send results back to the Driver.
"Spark's DAG optimizer can rearrange and pipeline operations for efficiency β unlike MapReduce which has a rigid mapβshuffleβreduce pattern. This is a key reason Spark is faster."
Interviewers love this one. There are two modes for running Spark:
Driver runs on YOUR machine (the machine that submitted the job). Good for debugging and interactive shells. If your machine disconnects, the job dies.
Driver runs ON the cluster (managed by YARN/Mesos). Better for production. Survives client disconnection. Less network latency between Driver and Executors.
Client mode: development, debugging, interactive analysis (Spark Shell). Cluster mode: production jobs, when your machine is far from the cluster, when the job must survive disconnections.
Spark's built-in manager. Simple to set up. Start master + workers manually. Good for dev/testing.
Hadoop's resource manager. Most popular in enterprise. Two sub-modes: yarn-client and yarn-cluster.
Apache Mesos provides dynamic partitioning between Spark and other frameworks. Less common today.
Modern container orchestration. Growing fast in cloud-native deployments. Each executor runs as a pod.
These behind-the-scenes processes keep Spark running. Know them for senior-level questions:
Converts logical DAG into physical stages. Determines which tasks can be pipelined together vs. which need shuffles.
Takes stages from DAGScheduler and sends tasks to Executors. Implements scheduling policies (FIFO or Fair).
Manages storage of data blocks across the cluster. Handles caching, shuffle data, and broadcast variables.
Manages in-memory storage for cached RDDs and broadcast variables within each Executor.