Module 5: Memory, Caching & Performance

cache() vs persist()

Imagine you're a chef who keeps using the same chopped onions in 5 different dishes. Instead of chopping fresh onions each time, you prep them once and keep them on the counter. That's caching in Spark.

Both cache() and persist() save an RDD/DataFrame in memory so it doesn't need to be recomputed. The difference?

cache()

Shorthand for persist(MEMORY_ONLY). Stores deserialized objects in JVM heap. Simple, fast, no options.

persist(level)

YOU choose the storage level. Can store on disk, serialize data, replicate across nodes. More flexible.

To remove cached data: call rdd.unpersist(). Spark also auto-evicts using LRU eviction.

Persistence Levels

This table is interview gold. Know when to use each level:

Level	Where Stored	Serialized?	When to Use
MEMORY_ONLY	JVM heap	No (fastest access)	Default. RDD fits in memory. Best performance.
MEMORY_ONLY_SER	JVM heap	Yes (saves space)	RDD barely fits. Trade CPU for less memory usage.
MEMORY_AND_DISK	Memory + Disk	No	RDD doesn't fully fit in memory. Spill overflow to disk.
MEMORY_AND_DISK_SER	Memory + Disk	Yes	Large RDD, tight memory. Serialize + spill.
DISK_ONLY	Disk only	Yes	Huge data, don't want recomputation, have disk space.

🎯 Decision Flowchart

RDD fits in memory? → MEMORY_ONLY. Barely fits? → MEMORY_ONLY_SER. Doesn't fit? → MEMORY_AND_DISK. Need fault tolerance? → Add _2 suffix (e.g., MEMORY_ONLY_2) for replication to 2 nodes.

Shuffle — The Performance Killer

A shuffle is like rearranging seats at a wedding — everyone has to get up and find their new table. It's expensive because data moves across the network between machines.

🚨 Operations That Cause Shuffles

groupByKey, reduceByKey, join, cogroup, repartition, coalesce (with shuffle=true)

💡 How to Minimize Shuffles

Use reduceByKey instead of groupByKey. Use broadcast joins for small tables. Set spark.shuffle.compress=true.

🎯 reduceByKey vs groupByKey

groupByKey shuffles ALL data, THEN reduces. reduceByKey reduces locally on each partition FIRST, then shuffles only partial results. reduceByKey is almost always better — less data moves across the network.

Broadcast Variables & Accumulators

Two special Spark tools for sharing data efficiently across your cluster:

📡 Broadcast Variable

A read-only variable sent to all nodes ONCE (not with every task). Perfect for lookup tables, configuration, or small reference datasets used in joins.

🔢 Accumulator

A write-only variable that workers can ADD to. Used for counters and sums. Only the Driver can read the final value.

# Broadcast: share a lookup table lookup = {"US": "United States", "UK": "United Kingdom"} bcast = sc.broadcast(lookup) rdd.map(lambda x: bcast.value[x]) # Accumulator: count errors err_count = sc.accumulator(0) def process(row): if row.invalid: err_count.add(1)

📚 Create a dictionary (lookup table)
 
📡 Broadcast it — sent to each worker ONCE
 
Workers access it via bcast.value (no network cost)
 
🔢 Create a counter starting at 0
When processing each row...
If it's invalid, increment the counter
(Only Driver reads the final count)

Memory Tuning Tips

Spark eats memory like a buffet. Here's how to keep it under control:

📊 Measure First

Cache an RDD, then check the Storage page in Spark Web UI (port 4040) to see actual memory used.

📦 Use Arrays, Not Maps

Prefer arrays and primitive types over HashMap and LinkedList. Fastutil library gives Java-compatible collection classes for primitives.

🔢 Numeric IDs > Strings

Use integer IDs instead of string keys. Strings are expensive (40+ bytes each).

🗜️ Compress Pointers

Set JVM flag -XX:+UseCompressedOops to make pointers 4 bytes instead of 8.

🎯 Partitioning Best Practice

More partitions = better parallelism (up to a point). Rule of thumb: 2-4 partitions per CPU core. For a 128MB HDFS block, Spark creates 1 partition. You can increase with sc.textFile("file", 20) to create 20 partitions. More partitions than blocks is often recommended for better performance.

Quiz: Performance

Q1: What's the difference between cache() and persist()?

A) cache() stores to disk, persist() stores to memory

B) cache() is shorthand for persist(MEMORY_ONLY). persist() lets you choose the storage level.

C) They are identical — there's no difference

Q2: Why is reduceByKey better than groupByKey in most cases?

A) reduceByKey combines values locally on each partition BEFORE shuffling, so less data crosses the network

B) reduceByKey uses less memory because it deletes the keys

C) groupByKey doesn't work with large datasets

Q3: You have a 100MB lookup table that every task needs. How should you share it?

A) Put it in a global variable

B) Use a Broadcast Variable — sent to each node once and cached

C) Write it to HDFS and read from each task

Q4: Your RDD doesn't fit in memory. Which persistence level is best?

A) MEMORY_ONLY — and just let Spark recompute overflowing partitions

B) MEMORY_AND_DISK — keeps what fits in memory, spills the rest to disk

C) DISK_ONLY — always safest