Imagine you're a chef who keeps using the same chopped onions in 5 different dishes. Instead of chopping fresh onions each time, you prep them once and keep them on the counter. That's caching in Spark.
Both cache() and persist() save an RDD/DataFrame in memory so it doesn't need to be recomputed. The difference?
Shorthand for persist(MEMORY_ONLY). Stores deserialized objects in JVM heap. Simple, fast, no options.
YOU choose the storage level. Can store on disk, serialize data, replicate across nodes. More flexible.
To remove cached data: call rdd.unpersist(). Spark also auto-evicts using LRU eviction.
This table is interview gold. Know when to use each level:
| Level | Where Stored | Serialized? | When to Use |
|---|---|---|---|
| MEMORY_ONLY | JVM heap | No (fastest access) | Default. RDD fits in memory. Best performance. |
| MEMORY_ONLY_SER | JVM heap | Yes (saves space) | RDD barely fits. Trade CPU for less memory usage. |
| MEMORY_AND_DISK | Memory + Disk | No | RDD doesn't fully fit in memory. Spill overflow to disk. |
| MEMORY_AND_DISK_SER | Memory + Disk | Yes | Large RDD, tight memory. Serialize + spill. |
| DISK_ONLY | Disk only | Yes | Huge data, don't want recomputation, have disk space. |
RDD fits in memory? → MEMORY_ONLY. Barely fits? → MEMORY_ONLY_SER. Doesn't fit? → MEMORY_AND_DISK. Need fault tolerance? → Add _2 suffix (e.g., MEMORY_ONLY_2) for replication to 2 nodes.
A shuffle is like rearranging seats at a wedding — everyone has to get up and find their new table. It's expensive because data moves across the network between machines.
groupByKey, reduceByKey, join, cogroup, repartition, coalesce (with shuffle=true)
Use reduceByKey instead of groupByKey. Use broadcast joins for small tables. Set spark.shuffle.compress=true.
groupByKey shuffles ALL data, THEN reduces. reduceByKey reduces locally on each partition FIRST, then shuffles only partial results. reduceByKey is almost always better — less data moves across the network.
Two special Spark tools for sharing data efficiently across your cluster:
A read-only variable sent to all nodes ONCE (not with every task). Perfect for lookup tables, configuration, or small reference datasets used in joins.
A write-only variable that workers can ADD to. Used for counters and sums. Only the Driver can read the final value.
Spark eats memory like a buffet. Here's how to keep it under control:
Cache an RDD, then check the Storage page in Spark Web UI (port 4040) to see actual memory used.
Prefer arrays and primitive types over HashMap and LinkedList. Fastutil library gives Java-compatible collection classes for primitives.
Use integer IDs instead of string keys. Strings are expensive (40+ bytes each).
Set JVM flag -XX:+UseCompressedOops to make pointers 4 bytes instead of 8.
More partitions = better parallelism (up to a point). Rule of thumb: 2-4 partitions per CPU core. For a 128MB HDFS block, Spark creates 1 partition. You can increase with sc.textFile("file", 20) to create 20 partitions. More partitions than blocks is often recommended for better performance.