Module 4: DataFrames, Datasets & Spark SQL

From RDDs to DataFrames

If RDDs are like a bag of unmarked boxes — you know stuff is inside but not what — then DataFrames are like a neatly labeled spreadsheet. Every column has a name and a type.

A DataFrame is conceptually like a SQL table. It knows that column A is a string, column B is an integer, column C is a date — and uses that information to optimize queries automatically.

📊 Named Columns

Data is organized into columns with names and types — not just opaque objects like RDDs.

⚡ Catalyst Optimized

Spark's Catalyst optimizer auto-optimizes your queries — predicate pushdown, column pruning, join reordering.

🌐 Multi-Source

Read from JSON, Parquet, CSV, Hive, JDBC, Avro, Elasticsearch, Cassandra — with one API.

🚀 2-3x Faster Schema

Providing a schema (vs. inferring one) boosts read performance 2-3x since Spark skips scanning.

RDD vs DataFrame vs Dataset

This comparison is asked in nearly every Spark interview. Memorize this table:

Feature	RDD	DataFrame	Dataset
Abstraction	Low-level, unstructured	High-level, named columns	High-level + type-safe
Schema	No schema	Has schema (inferred or defined)	Has schema + compile-time types
Optimization	None — you optimize manually	Catalyst + Tungsten optimized	Catalyst + Tungsten optimized
Type Safety	Yes (at compile time)	No (runtime errors only)	Yes (at compile time)
API	Functional (map, filter, reduce)	Declarative (SQL-like)	Both functional + declarative
Languages	Scala, Java, Python	Scala, Java, Python, R	Scala, Java only
Performance	Slowest	Faster	Faster (like DataFrame)
When to use	Legacy code, unstructured data	Most use cases, SQL queries	When you need type safety in Scala/Java

💡 The Evolution

RDD (Spark 1.0) → DataFrame (Spark 1.3) → Dataset (Spark 1.6). In Python, DataFrame = Dataset[Row]. In Scala, Dataset is the superset. For interviews, know that DataFrames are the default choice for most modern Spark work.

The Catalyst Optimizer

Think of Catalyst like a GPS for your query. You tell it WHERE you want to go (the result), and it figures out the FASTEST route (execution plan).

1️⃣ Analysis

Resolves column names and types. Checks that your query makes sense.

2️⃣ Logical Optimization

Applies rules: push filters down, prune unused columns, simplify expressions.

3️⃣ Physical Planning

Generates multiple execution plans and picks the most efficient one using cost estimation.

4️⃣ Code Generation

Tungsten generates optimized Java bytecode for the chosen plan. Runs directly on CPU.

🎯 Predicate Pushdown

When reading from a database via JDBC, Catalyst pushes WHERE clauses to the database itself — so the database filters data BEFORE sending it to Spark. Less data transferred = faster processing.

Parquet — The Favorite File Format

If CSV is like a handwritten notebook — readable but slow to search — then Parquet is like a filing cabinet with labeled folders. You can grab exactly what you need without opening everything.

📦 Columnar Format

Stores data by column, not row. Reading only the columns you need is blazing fast.

🗜️ Compressed

Built-in compression (Snappy, Gzip). Files are much smaller than CSV/JSON equivalents.

📋 Schema Embedded

The file contains its own schema. No need for a separate schema definition.

🌐 Universal

Works with Spark, Hive, Presto, Athena, BigQuery, Pandas — the lingua franca of big data.

# Reading Parquet df = spark.read.parquet("s3://data/orders.parquet") # Writing Parquet df.write.parquet("s3://out/results.parquet") # Reading with SQL df.createOrReplaceTempView("orders") result = spark.sql(""" SELECT product, SUM(amount) FROM orders WHERE amount > 100 GROUP BY product """)

📂 Read a Parquet file from S3 into a DataFrame
 
💾 Write DataFrame back as Parquet
 
🏷️ Register the DataFrame as a temp SQL table
🔍 Now you can query it with plain SQL!
 
Get total amount per product where amount > 100

Spark SQL Deep Dive

Spark SQL lets you write regular SQL queries against your Spark data. Behind the scenes, it uses the same Catalyst optimizer as DataFrame operations.

📊 Data Sources

Parquet, JSON, CSV, ORC, Hive tables, JDBC/ODBC databases, Avro, Elasticsearch.

🔗 JDBC

Connect to MySQL, PostgreSQL, Oracle. Spark pushes predicates to the DB. Most efficient data source.

🗂️ Schema Options

Infer: Spark reads data and guesses types (convenient but slow). Define: You specify types (2-3x faster).

💡 Corrupt Records in JSON

When using spark.read.json(), corrupt records appear in a column called _corrupt_record by default. Spark won't crash — it just sets aside the bad rows. Know this for interviews!

Quiz: DataFrames & SQL

Q1: Why are DataFrames faster than RDDs for structured data?

A) DataFrames use more memory for caching

B) DataFrames are optimized by the Catalyst optimizer and Tungsten engine — which auto-optimize query plans, pushes down predicates, and generates efficient bytecode

C) DataFrames only work with small data so they're naturally faster

Q2: What's the advantage of defining your own schema instead of letting Spark infer it?

A) 2-3x faster performance (no need to scan all data), custom types, and only parsing needed fields

B) It makes the data more secure

C) Spark requires a schema — inference doesn't actually work

Q3: Which file format stores data in columns (not rows), includes its schema, and supports compression?

A) CSV

B) JSON

C) Parquet

D) XML

Q4: You want compile-time type safety in Scala with all the performance benefits of DataFrames. What should you use?

A) RDD — always type-safe

B) DataFrame — has type checks at runtime

C) Dataset — combines type safety with Catalyst/Tungsten optimizations