Module 7: Scala + Apache Spark

Screen 1 of 5

RDD vs DataFrame vs Dataset

Think of these as three generations of data shipping containers. Each newer version is smarter and more optimized than the last.

Feature	RDD	DataFrame	Dataset
Generation	1st (original)	2nd	3rd (best of both)
Type safety	✅ At compile time	❌ Runtime only	✅ At compile time
Optimization	❌ Manual	✅ Catalyst optimizer	✅ Catalyst optimizer
Schema	No schema	Has schema	Has schema + types
API style	Functional (map, filter)	SQL-like	Both!
Best for	Low-level control	SQL queries, ETL	Type-safe pipelines

🎯 Interview Gold

"When should you use RDDs?" → Almost never in modern Spark. Use DataFrames for untyped operations (SQL-like) or Datasets for type-safe pipelines. RDDs lack Catalyst optimization and are significantly slower.

Screen 2 of 5

Writing a Spark Job in Scala

Think of a Spark job like running a factory operation: start the factory (SparkSession), load raw materials (data), process them (transformations), and ship the product (actions).

import org.apache.spark.sql.SparkSession val spark = SparkSession.builder .appName("WordCount") .getOrCreate() val data = spark.read.text("data.txt") val words = data .flatMap(_.split(" ")) .groupByKey(identity) .count() words.show() spark.stop()

→ SparkSession is the entry point — like opening the factory doors.
→ .appName names your job (visible in Spark UI).
→ Read a text file as a DataFrame.
→ Split each line into words, group identical words, count them.
→ .show() is an ACTION — this triggers actual execution!
→ Clean up resources when done.

Screen 3 of 5

Lazy Evaluation & Transformations vs Actions

Spark is like a brilliant but lazy architect. It draws up the entire blueprint (DAG) first, then finds the most efficient way to build — only when you say "GO!"

.read.text()

→

.filter(...)

→

.map(...)

→

.groupBy(...)

→

.count() 🚀

Click "Next Step" to see how Spark builds its execution plan.

📝 Transformations (lazy)

map, filter, flatMap, groupBy, join, select — these build the plan but do NOTHING yet. Like writing a recipe.

🚀 Actions (trigger execution)

count, show, collect, save, first — these say "GO!" Spark optimizes the plan then executes. Like cooking the recipe.

💡 Why Lazy?

Spark can optimize the ENTIRE pipeline before running. It pushes filters earlier, combines operations, minimizes data shuffling. A eager system can't do this because it runs each step immediately.

Screen 4 of 5

Optimizing Spark Jobs — Top Interview Questions

Data engineering interviews LOVE asking about Spark optimization. Here are the top strategies:

📊 Use DataFrames

Catalyst optimizer makes DataFrames 10-100x faster than RDDs for the same operation. Always prefer them.

📡 Broadcast Joins

When joining a small table with a large one, broadcast the small table to all nodes. Avoids expensive shuffles.

📦 Caching

Use .cache() or .persist() for DataFrames you read multiple times. Avoids re-computation.

🔧 Partitioning

Right number of partitions = right parallelism. Too few = underutilized cluster. Too many = overhead.

🚫 Avoid Shuffles

Operations like groupBy and join shuffle data across the network. Minimize them or use pre-partitioned data.

⬇️ Filter Early

Push filters as early as possible in the pipeline. Less data flowing through = faster everything.

🎯 Common Interview Q

"How do you handle data skew?" → Salt the join key (add random prefix to distribute evenly), use adaptive query execution (AQE in Spark 3+), or isolate skewed keys and process them separately.

Screen 5 of 5

Test Yourself 🧠

Q1: You write df.filter(...).map(...).groupBy(...) in Spark. How much computation has happened?

None — these are all transformations (lazy). No action has triggered execution.

All three operations have run

Only filter has run

It depends on the data size

Q2: You're joining a 1TB table with a 50MB lookup table. What optimization should you use?

Repartition both tables to the same number of partitions

Broadcast join — send the 50MB table to all nodes to avoid shuffling the 1TB table

Convert both to RDDs first

Use collect() to bring both tables to the driver

Q3: Why are Datasets preferred over RDDs in modern Spark?

Datasets are easier to write

Datasets use less memory

Datasets have compile-time type safety AND benefit from Spark's Catalyst query optimizer

Datasets can only be used with Scala, not Python