RDD = Resilient Distributed Dataset

Think of an RDD like a deck of cards split among friends. Each friend holds a few cards (partitions). If one friend drops their cards, the group remembers which cards were dealt and can recreate them.

🛡️ Resilient

If a partition is lost (machine crashes), Spark rebuilds it using the lineage graph — no data replication needed.

🌐 Distributed

Data is split across multiple machines in the cluster. Each machine processes its chunk in parallel.

📋 Dataset

A collection of records — could be strings, numbers, objects. Similar to a list/array but spread across machines.

🔒 Immutable

Once created, an RDD cannot be changed. You can only create a NEW RDD by transforming an existing one.

Two Ways to Create an RDD

📥 From Internal Data

Parallelize an existing collection in your program. Takes a local list and distributes it across the cluster.

data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data)

📂 From External Data

Read from HDFS, S3, local file system, HBase, Cassandra — any supported data source.

rdd = sc.textFile("hdfs:///data/logs.txt") rdd = sc.textFile("s3://bucket/file.csv")

There are also two types of RDDs: Parallelized Collections (from local data) and Hadoop Datasets (from HDFS/external).

Transformations vs Actions

This is one of the most frequently asked Spark interview questions. Everything you do with an RDD is either a Transformation or an Action.

AspectTransformationsActions
What it doesCreates a NEW RDD from an existing oneReturns a value to the Driver or writes to storage
Execution⏳ Lazy — nothing happens until an action⚡ Triggers actual computation
ReturnsAnother RDDA value (number, list, etc.) or side effect
Examplesmap() filter() flatMap() union() join()collect() count() reduce() first() take(n)
# Transformations (lazy — nothing runs yet) raw = sc.textFile("movies.txt") parsed = raw.map(lambda x: x.split("\t")) filtered = parsed.filter(lambda x: x[2] > "4.0") # Action (NOW everything executes!) result = filtered.count()
📄 Read file → creates RDD (but doesn't read yet!)
✂️ Split each line by tab → new RDD (still nothing runs)
🔍 Keep only ratings > 4.0 → new RDD (still lazy!)
 
⚡ count() is an ACTION — NOW Spark reads the file,
splits, filters, and counts — all at once, optimized!

Lazy Evaluation — Why Spark Waits

Imagine you're giving a cooking assistant a list of instructions: "dice onions, then sauté them, then add tomatoes." A lazy assistant writes down all the steps but doesn't start cooking until you say "Serve the dish!"

That's Spark. Lazy evaluation lets Spark optimize the entire pipeline before doing any work.

⏰ Reduces wasted work

If you filter 90% of data away, Spark doesn't process those rows at all.

🧩 Enables pipelining

Multiple transformations are fused into one pass over the data.

📊 Fewer queries

Spark can combine operations and reduce disk reads.

💡 Lineage Graph (RDD Lineage)

Because Spark is lazy, it tracks how every RDD was created from other RDDs. This chain is called the lineage graph. If any partition is lost, Spark replays the lineage to rebuild ONLY that partition — no need to copy all data across machines like Hadoop does.

Common Transformations Cheat Sheet

TransformationWhat it doesExample
map(func)Apply func to each element, return new RDDrdd.map(lambda x: x*2)
filter(func)Keep elements where func returns Truerdd.filter(lambda x: x>10)
flatMap(func)Like map, but can return 0+ elements per inputrdd.flatMap(lambda x: x.split())
union(other)Combine two RDDsrdd1.union(rdd2)
intersection(other)Elements common to both RDDsrdd1.intersection(rdd2)
reduceByKey(func)Group by key, apply reduce functionpairs.reduceByKey(lambda a,b: a+b)
groupByKey()Group values by keypairs.groupByKey()
join(other)Inner join on keysrdd1.join(rdd2)
coalesce(n)Reduce partitions (no shuffle if decreasing)rdd.coalesce(4)
repartition(n)Change partition count (always shuffles)rdd.repartition(10)
🎯 Interview Favorite: coalesce vs repartition

coalesce can only REDUCE partitions and avoids a full shuffle (faster). repartition can increase OR decrease partitions but always triggers a shuffle. Use coalesce when reducing, repartition when increasing.

Quiz: RDD Mastery

Q1: You call rdd.map(lambda x: x*2) but nothing happens. Why?

A) There's a bug in the lambda function
B) map() is a transformation — it's lazy. Nothing executes until you call an action like collect() or count().
C) The RDD is empty

Q2: An executor crashes and loses partition 3 of your RDD. How does Spark recover?

A) It reads the backup copy from HDFS replication
B) It replays the lineage graph to recompute only partition 3 from its parent RDDs
C) The entire job fails and must be restarted

Q3: You want to go from 1000 partitions to 100. What should you use?

A) coalesce(100) — it can reduce partitions without a full shuffle, making it faster
B) repartition(100) — always use repartition for any partition change
C) filter() — remove data until you have fewer partitions

Q4: What's a Pair RDD and when would you use it?

A) An RDD with exactly two elements
B) An RDD where each element is a (key, value) pair. Enables operations like reduceByKey(), groupByKey(), and join().
C) Two RDDs linked together for backup purposes