Module 1: What Is Apache Spark? | Spark Interview Mastery

⚡ What Is Apache Spark?

Imagine you have a mountain of data — billions of rows of customer transactions, sensor readings, or social media posts. One computer would take days to crunch through it.

Apache Spark is like hiring a flash mob of workers who split the data into pieces, process them simultaneously across dozens (or thousands) of machines, and bring back the results — often in seconds.

💡 Interview Gold

"Spark is a fast, general-purpose distributed computing engine for large-scale data processing, with built-in modules for SQL, streaming, machine learning, and graph processing."

Key point: Spark doesn't store data. It processes data. It can read from HDFS, S3, Cassandra, databases — whatever you've got.

The 5 Features Interviewers Ask About

These show up in almost every Spark interview. Know them cold.

🚀 Speed

Up to 100x faster than Hadoop MapReduce (in memory) and 10x faster on disk. Spark keeps data in RAM between steps instead of writing to disk each time.

🌐 Polyglot

Write Spark code in Java, Scala, Python, R, or SQL. Python (PySpark) is the most popular today.

🔄 Unified Engine

One engine for batch processing, streaming, ML, SQL queries, and graph analysis. No need for separate tools.

📦 Runs Anywhere

Standalone, on YARN, Mesos, Kubernetes, or cloud (AWS EMR, Databricks, GCP Dataproc). Connects to HDFS, S3, Cassandra, HBase.

⏰ Real-Time

Spark Streaming processes live data with low latency — essential for fraud detection, IoT, and real-time dashboards.

Spark vs Hadoop MapReduce

This is the #1 most-asked Spark comparison question. Here's the cheat sheet:

⚡ Apache Spark

✅ In-memory processing — blazing fast
✅ Batch + real-time streaming
✅ Built-in ML, SQL, Graph libraries
✅ Lazy evaluation with DAG optimizer
✅ Has its own job scheduler
✅ Rich APIs in 5 languages
⚠️ Higher memory cost
⚠️ No built-in file system

🐘 Hadoop MapReduce

❌ Disk-based — reads/writes for every step
❌ Batch processing only
❌ Needs separate tools for ML, SQL
❌ Simple map → reduce pipeline
❌ Needs Oozie/Azkaban for scheduling
⚠️ Only Java-native (Pig/Hive help)
✅ Lower memory cost (disk-based)
✅ Has HDFS built in

🎯 Interview Tip

Don't say "Spark is always better." Say: "Spark excels at iterative processing and real-time workloads. MapReduce is simpler and more cost-effective for straightforward batch ETL on very large datasets."

How Spark Components Talk

Watch how a Spark job flows from your code to results:

You

I just submitted a PySpark job: count all orders over $100.

Drv

Got it! I'm the Driver. Let me create a SparkContext and build a DAG of operations…

CM

I'm the Cluster Manager (YARN). I'll find worker nodes with free resources and launch Executors.

Ex1

Executor 1 here! I've got partition 1 loaded. Filtering orders > $100… found 1,247 matches!

Ex2

Executor 2 done! Partition 2 had 983 matches.

Drv

Combining results: 1,247 + 983 = 2,230 orders over $100. Sending back to you!

Quiz: Test Yourself

Q1: Your interviewer says: "Why is Spark faster than MapReduce?" What's the BEST answer?

A) Spark uses a better programming language than Hadoop

B) Spark processes data in-memory and avoids writing intermediate results to disk between steps

C) Spark runs on more powerful hardware

D) Spark compresses data better than Hadoop

Q2: A company needs to run SQL queries AND train ML models on the same data. Why is Spark a good fit?

A) Spark is the only tool that supports SQL

B) Spark is a unified engine with built-in modules for SQL (Spark SQL), ML (MLlib), streaming, and graphs

C) Spark automatically generates ML models from SQL queries

D) Spark replaces the database entirely

Q3: Do you need to install Spark on every node of a YARN cluster?

A) Yes — every node needs Spark binaries to run executors

B) No — Spark runs on top of YARN, which handles resource allocation. Spark jars are shipped automatically.

C) Only the master node needs Spark installed

Q4: When would you choose Hadoop MapReduce OVER Spark?

A) When you need real-time stream processing

B) When you need to run iterative ML algorithms

C) When budget is tight and you have straightforward batch ETL on very large datasets (MapReduce uses less memory)

D) When you need SQL query support

The Spark Ecosystem

Spark isn't one tool — it's a Swiss Army knife of data processing. Know these components:

🧩 Spark Core

The engine underneath everything. Handles job scheduling, memory management, fault recovery, and I/O.

📊 Spark SQL

Run SQL queries on structured data. Uses the Catalyst optimizer. Supports Parquet, JSON, Hive, JDBC sources.

🌊 Spark Streaming

Process live data from Kafka, Flume, Kinesis. Divides streams into micro-batches of RDDs (DStreams).

🧠 MLlib

Machine learning library: classification, regression, clustering, collaborative filtering, pipelines.

🔗 GraphX

Graph processing: PageRank, shortest paths, community detection. Uses vertex-cut partitioning.

📝 SparkR

R language bindings for Spark. Use R data frames and dplyr within the Spark ecosystem.

# A taste of PySpark from pyspark.sql import SparkSession spark = SparkSession.builder\ .appName("MyApp")\ .getOrCreate() df = spark.read.json("orders.json") df.filter(df.amount > 100).count()

📦 Import the SparkSession builder
 
🔧 Create a Spark session named "MyApp" — this is your entry point to Spark
 
📂 Read a JSON file into a DataFrame (like a table)
🔍 Keep only rows where amount > 100, then count them