Think of an RDD like a deck of cards split among friends. Each friend holds a few cards (partitions). If one friend drops their cards, the group remembers which cards were dealt and can recreate them.
If a partition is lost (machine crashes), Spark rebuilds it using the lineage graph — no data replication needed.
Data is split across multiple machines in the cluster. Each machine processes its chunk in parallel.
A collection of records — could be strings, numbers, objects. Similar to a list/array but spread across machines.
Once created, an RDD cannot be changed. You can only create a NEW RDD by transforming an existing one.
Parallelize an existing collection in your program. Takes a local list and distributes it across the cluster.
Read from HDFS, S3, local file system, HBase, Cassandra — any supported data source.
There are also two types of RDDs: Parallelized Collections (from local data) and Hadoop Datasets (from HDFS/external).
This is one of the most frequently asked Spark interview questions. Everything you do with an RDD is either a Transformation or an Action.
| Aspect | Transformations | Actions |
|---|---|---|
| What it does | Creates a NEW RDD from an existing one | Returns a value to the Driver or writes to storage |
| Execution | ⏳ Lazy — nothing happens until an action | ⚡ Triggers actual computation |
| Returns | Another RDD | A value (number, list, etc.) or side effect |
| Examples | map() filter() flatMap() union() join() | collect() count() reduce() first() take(n) |
Imagine you're giving a cooking assistant a list of instructions: "dice onions, then sauté them, then add tomatoes." A lazy assistant writes down all the steps but doesn't start cooking until you say "Serve the dish!"
That's Spark. Lazy evaluation lets Spark optimize the entire pipeline before doing any work.
If you filter 90% of data away, Spark doesn't process those rows at all.
Multiple transformations are fused into one pass over the data.
Spark can combine operations and reduce disk reads.
Because Spark is lazy, it tracks how every RDD was created from other RDDs. This chain is called the lineage graph. If any partition is lost, Spark replays the lineage to rebuild ONLY that partition — no need to copy all data across machines like Hadoop does.
| Transformation | What it does | Example |
|---|---|---|
map(func) | Apply func to each element, return new RDD | rdd.map(lambda x: x*2) |
filter(func) | Keep elements where func returns True | rdd.filter(lambda x: x>10) |
flatMap(func) | Like map, but can return 0+ elements per input | rdd.flatMap(lambda x: x.split()) |
union(other) | Combine two RDDs | rdd1.union(rdd2) |
intersection(other) | Elements common to both RDDs | rdd1.intersection(rdd2) |
reduceByKey(func) | Group by key, apply reduce function | pairs.reduceByKey(lambda a,b: a+b) |
groupByKey() | Group values by key | pairs.groupByKey() |
join(other) | Inner join on keys | rdd1.join(rdd2) |
coalesce(n) | Reduce partitions (no shuffle if decreasing) | rdd.coalesce(4) |
repartition(n) | Change partition count (always shuffles) | rdd.repartition(10) |
coalesce can only REDUCE partitions and avoids a full shuffle (faster). repartition can increase OR decrease partitions but always triggers a shuffle. Use coalesce when reducing, repartition when increasing.