πŸ”₯ The Tricky Questions

These are the questions that trip up 90% of candidates. Click each to reveal the answer:

TRICKY Do you need HDFS to run Spark?

No! This is a trick question. Spark supports multiple file systems: HDFS, S3, local filesystem, Cassandra, HBase, OpenStack Swift, MapR FS. HDFS is just ONE option. Spark is a processing engine, not a storage system.

TRICKY Do you need to install Spark on all YARN nodes?

No. Spark runs ON TOP of YARN. YARN manages resources and launches Spark executors. The Spark jars are shipped automatically to worker nodes. You don't install Spark on each node separately.

TRICKY Is Spark always better than MapReduce?

No! MapReduce is better when: (1) budget is tight (less memory needed), (2) straightforward batch ETL with no iteration, (3) data is too large for in-memory processing, (4) you already have a mature Hadoop ecosystem. Never say "Spark is always better" in an interview.

HOT What are common developer mistakes with Spark?

β€’ Running everything on a single local node instead of distributing across the cluster
β€’ Hitting web services repeatedly from multiple cluster nodes
β€’ Using groupByKey when reduceByKey would be more efficient
β€’ Not caching RDDs that are used multiple times
β€’ Having too many small files (Spark struggles with many small gzipped files)

SENIOR What is speculative execution?

When a task runs significantly slower than its peers (a "straggler"), Spark launches a duplicate copy on another node. Whichever finishes first wins. Enable with spark.speculation=true. Useful for avoiding bottlenecks from slow hardware.

SENIOR How does Spark achieve fault tolerance without replication?

Through RDD lineage. Unlike Hadoop which replicates data 3x, Spark tracks how each RDD was derived from others. If a partition is lost, Spark replays the transformations to rebuild ONLY that partition. This is cheaper than replication.

HOT How is DSM (Distributed Shared Memory) different from RDD?

RDD: Coarse-grained reads/writes, immutable, fault recovery via lineage, stragglers mitigated with backup tasks, overflows to disk.
DSM: Fine-grained reads/writes, mutable, fault recovery via checkpointing, hard to mitigate stragglers, performance degrades when RAM is full.

⚑ Rapid Fire Q&A

Quick answers for quick questions. Memorize these:

What is SchemaRDD?
An RDD with row objects + schema info about column types. Predecessor to DataFrames. Now deprecated.
Default executor memory?
1 GB (spark.executor.memory)
What file format for analytics?
Parquet β€” columnar, compressed, schema-embedded.
How to remove cached data?
rdd.unpersist() or automatic LRU eviction.
What is Tachyon?
Library for reliable file sharing at memory speed across cluster frameworks.
What is BlinkDB?
Query engine for approximate SQL queries on huge data with error bars.
Spark security options?
SSL encryption (HTTPS), shared-secret authentication, event log permissions.
What port for Spark Web UI?
Port 4040 by default.
Optimal cores per executor?
5 cores (rule of thumb).
Types of partitioners?
HashPartitioner, RangePartitioner, Custom Partitioner.
Dense vs Sparse Vector?
Dense: array of all values. Sparse: two arrays (indices + non-zero values). Use sparse when most values are zero.
Supported data formats?
Parquet, JSON, CSV, Avro, ORC, XML, TSV, RC files.
How to achieve HA?
Single-node recovery with local FS, or StandBy Masters with ZooKeeper.
What is pipe() in Spark?
Passes each RDD partition through a UNIX shell command for processing.
subtractByKey() does what?
Removes elements whose key exists in another RDD.
How to trigger metadata cleanup?
Set spark.cleaner.ttl or split long jobs into batches writing intermediate results to disk.

🎯 Scenario Quiz

S1: Your Spark job is slow. You notice excessive shuffle operations. What's your FIRST optimization step?

A) Add more machines to the cluster
B) Replace groupByKey with reduceByKey, use broadcast joins for small tables, and enable spark.shuffle.compress
C) Switch to MapReduce
D) Increase executor memory

S2: You're joining a 500GB dataset with a 50MB lookup table. How do you optimize this?

A) Use a regular join β€” Spark will handle it
B) Broadcast the 50MB table with sc.broadcast() β€” each executor gets a copy, avoiding a massive shuffle
C) Split the 500GB dataset into smaller files first

S3: Your streaming job crashes and you lose in-flight data. What should you have set up?

A) More executor memory
B) Faster network between nodes
C) Write-Ahead Logging (WAL) + checkpointing β€” WAL persists received data before processing, checkpointing saves state

S4: You have 10 nodes, 15 cores each, 61GB RAM each. How many executors should you configure?

A) 10 executors (1 per node)
B) 150 executors (1 per core)
C) 30 executors (5 cores per executor Γ— 3 executors per node Γ— 10 nodes)
D) 15 executors (15 cores / 1 per executor per node)

S5: Your RDD has a long lineage chain (50+ transformations). Recovery after failure takes forever. What do you do?

A) Add data checkpoints at intermediate steps to break the lineage chain and speed up recovery
B) Increase the number of partitions
C) Use DISK_ONLY persistence

S6: You need to count errors across all executors during processing. What Spark feature should you use?

A) A broadcast variable
B) An accumulator β€” a write-only variable that workers can add to, and only the Driver reads the final value
C) A global Python variable

πŸ† You Made It!

You've covered the most-asked Apache Spark interview topics:

βœ… Module 1
What Spark is, key features, Spark vs Hadoop
βœ… Module 2
Driver, Executors, DAG, deployment modes
βœ… Module 3
RDDs, transformations, actions, lazy evaluation
βœ… Module 4
DataFrames, Datasets, Catalyst, Parquet
βœ… Module 5
Caching, persistence, shuffles, broadcast
βœ… Module 6
Streaming, DStreams, MLlib, GraphX
βœ… Module 7
Tricky questions, scenarios, rapid fire
βœ… Module 8
Spark UI navigation, query plans, task metrics, skew detection
βœ… Module 9
Bottleneck triage, join optimization, AQE, decision tables
πŸš€ Final Tips for the Interview

β€’ Always explain WHY, not just WHAT. "Spark uses in-memory processing" β†’ "Spark uses in-memory processing to avoid expensive disk I/O between steps, making iterative algorithms dramatically faster."
β€’ Use trade-offs. Never say X is always better than Y.
β€’ Give examples from real use cases: fraud detection, ETL pipelines, recommendation engines.
β€’ If you don't know, say "I'm not sure, but I'd reason through it like this…" β€” interviewers love to see thinking.

Back to Course Home ⚑