Screen 1 of 5

RDD vs DataFrame vs Dataset

Think of these as three generations of data shipping containers. Each newer version is smarter and more optimized than the last.

FeatureRDDDataFrameDataset
Generation1st (original)2nd3rd (best of both)
Type safety✅ At compile time❌ Runtime only✅ At compile time
Optimization❌ Manual✅ Catalyst optimizer✅ Catalyst optimizer
SchemaNo schemaHas schemaHas schema + types
API styleFunctional (map, filter)SQL-likeBoth!
Best forLow-level controlSQL queries, ETLType-safe pipelines
🎯 Interview Gold
"When should you use RDDs?" → Almost never in modern Spark. Use DataFrames for untyped operations (SQL-like) or Datasets for type-safe pipelines. RDDs lack Catalyst optimization and are significantly slower.
Screen 2 of 5

Writing a Spark Job in Scala

Think of a Spark job like running a factory operation: start the factory (SparkSession), load raw materials (data), process them (transformations), and ship the product (actions).

import org.apache.spark.sql.SparkSession val spark = SparkSession.builder .appName("WordCount") .getOrCreate() val data = spark.read.text("data.txt") val words = data .flatMap(_.split(" ")) .groupByKey(identity) .count() words.show() spark.stop()
SparkSession is the entry point — like opening the factory doors.
.appName names your job (visible in Spark UI).
Read a text file as a DataFrame.
Split each line into words, group identical words, count them.
.show() is an ACTION — this triggers actual execution!
Clean up resources when done.
Screen 3 of 5

Lazy Evaluation & Transformations vs Actions

Spark is like a brilliant but lazy architect. It draws up the entire blueprint (DAG) first, then finds the most efficient way to build — only when you say "GO!"

.read.text()
.filter(...)
.map(...)
.groupBy(...)
.count() 🚀
Click "Next Step" to see how Spark builds its execution plan.

📝 Transformations (lazy)

map, filter, flatMap, groupBy, join, select — these build the plan but do NOTHING yet. Like writing a recipe.

🚀 Actions (trigger execution)

count, show, collect, save, first — these say "GO!" Spark optimizes the plan then executes. Like cooking the recipe.

💡 Why Lazy?
Spark can optimize the ENTIRE pipeline before running. It pushes filters earlier, combines operations, minimizes data shuffling. A eager system can't do this because it runs each step immediately.
Screen 4 of 5

Optimizing Spark Jobs — Top Interview Questions

Data engineering interviews LOVE asking about Spark optimization. Here are the top strategies:

📊 Use DataFrames

Catalyst optimizer makes DataFrames 10-100x faster than RDDs for the same operation. Always prefer them.

📡 Broadcast Joins

When joining a small table with a large one, broadcast the small table to all nodes. Avoids expensive shuffles.

📦 Caching

Use .cache() or .persist() for DataFrames you read multiple times. Avoids re-computation.

🔧 Partitioning

Right number of partitions = right parallelism. Too few = underutilized cluster. Too many = overhead.

🚫 Avoid Shuffles

Operations like groupBy and join shuffle data across the network. Minimize them or use pre-partitioned data.

⬇️ Filter Early

Push filters as early as possible in the pipeline. Less data flowing through = faster everything.

🎯 Common Interview Q
"How do you handle data skew?" → Salt the join key (add random prefix to distribute evenly), use adaptive query execution (AQE in Spark 3+), or isolate skewed keys and process them separately.
Screen 5 of 5

Test Yourself 🧠

Q1: You write df.filter(...).map(...).groupBy(...) in Spark. How much computation has happened?
None — these are all transformations (lazy). No action has triggered execution.
All three operations have run
Only filter has run
It depends on the data size
Q2: You're joining a 1TB table with a 50MB lookup table. What optimization should you use?
Repartition both tables to the same number of partitions
Broadcast join — send the 50MB table to all nodes to avoid shuffling the 1TB table
Convert both to RDDs first
Use collect() to bring both tables to the driver
Q3: Why are Datasets preferred over RDDs in modern Spark?
Datasets are easier to write
Datasets use less memory
Datasets have compile-time type safety AND benefit from Spark's Catalyst query optimizer
Datasets can only be used with Scala, not Python