If RDDs are like a bag of unmarked boxes — you know stuff is inside but not what — then DataFrames are like a neatly labeled spreadsheet. Every column has a name and a type.
A DataFrame is conceptually like a SQL table. It knows that column A is a string, column B is an integer, column C is a date — and uses that information to optimize queries automatically.
Data is organized into columns with names and types — not just opaque objects like RDDs.
Spark's Catalyst optimizer auto-optimizes your queries — predicate pushdown, column pruning, join reordering.
Read from JSON, Parquet, CSV, Hive, JDBC, Avro, Elasticsearch, Cassandra — with one API.
Providing a schema (vs. inferring one) boosts read performance 2-3x since Spark skips scanning.
This comparison is asked in nearly every Spark interview. Memorize this table:
| Feature | RDD | DataFrame | Dataset |
|---|---|---|---|
| Abstraction | Low-level, unstructured | High-level, named columns | High-level + type-safe |
| Schema | No schema | Has schema (inferred or defined) | Has schema + compile-time types |
| Optimization | None — you optimize manually | Catalyst + Tungsten optimized | Catalyst + Tungsten optimized |
| Type Safety | Yes (at compile time) | No (runtime errors only) | Yes (at compile time) |
| API | Functional (map, filter, reduce) | Declarative (SQL-like) | Both functional + declarative |
| Languages | Scala, Java, Python | Scala, Java, Python, R | Scala, Java only |
| Performance | Slowest | Faster | Faster (like DataFrame) |
| When to use | Legacy code, unstructured data | Most use cases, SQL queries | When you need type safety in Scala/Java |
RDD (Spark 1.0) → DataFrame (Spark 1.3) → Dataset (Spark 1.6). In Python, DataFrame = Dataset[Row]. In Scala, Dataset is the superset. For interviews, know that DataFrames are the default choice for most modern Spark work.
Think of Catalyst like a GPS for your query. You tell it WHERE you want to go (the result), and it figures out the FASTEST route (execution plan).
Resolves column names and types. Checks that your query makes sense.
Applies rules: push filters down, prune unused columns, simplify expressions.
Generates multiple execution plans and picks the most efficient one using cost estimation.
Tungsten generates optimized Java bytecode for the chosen plan. Runs directly on CPU.
When reading from a database via JDBC, Catalyst pushes WHERE clauses to the database itself — so the database filters data BEFORE sending it to Spark. Less data transferred = faster processing.
If CSV is like a handwritten notebook — readable but slow to search — then Parquet is like a filing cabinet with labeled folders. You can grab exactly what you need without opening everything.
Stores data by column, not row. Reading only the columns you need is blazing fast.
Built-in compression (Snappy, Gzip). Files are much smaller than CSV/JSON equivalents.
The file contains its own schema. No need for a separate schema definition.
Works with Spark, Hive, Presto, Athena, BigQuery, Pandas — the lingua franca of big data.
Spark SQL lets you write regular SQL queries against your Spark data. Behind the scenes, it uses the same Catalyst optimizer as DataFrame operations.
Parquet, JSON, CSV, ORC, Hive tables, JDBC/ODBC databases, Avro, Elasticsearch.
Connect to MySQL, PostgreSQL, Oracle. Spark pushes predicates to the DB. Most efficient data source.
Infer: Spark reads data and guesses types (convenient but slow). Define: You specify types (2-3x faster).
When using spark.read.json(), corrupt records appear in a column called _corrupt_record by default. Spark won't crash — it just sets aside the bad rows. Know this for interviews!