These are the questions that trip up 90% of candidates. Click each to reveal the answer:
No! This is a trick question. Spark supports multiple file systems: HDFS, S3, local filesystem, Cassandra, HBase, OpenStack Swift, MapR FS. HDFS is just ONE option. Spark is a processing engine, not a storage system.
No. Spark runs ON TOP of YARN. YARN manages resources and launches Spark executors. The Spark jars are shipped automatically to worker nodes. You don't install Spark on each node separately.
No! MapReduce is better when: (1) budget is tight (less memory needed), (2) straightforward batch ETL with no iteration, (3) data is too large for in-memory processing, (4) you already have a mature Hadoop ecosystem. Never say "Spark is always better" in an interview.
β’ Running everything on a single local node instead of distributing across the cluster
β’ Hitting web services repeatedly from multiple cluster nodes
β’ Using groupByKey when reduceByKey would be more efficient
β’ Not caching RDDs that are used multiple times
β’ Having too many small files (Spark struggles with many small gzipped files)
When a task runs significantly slower than its peers (a "straggler"), Spark launches a duplicate copy on another node. Whichever finishes first wins. Enable with spark.speculation=true. Useful for avoiding bottlenecks from slow hardware.
Through RDD lineage. Unlike Hadoop which replicates data 3x, Spark tracks how each RDD was derived from others. If a partition is lost, Spark replays the transformations to rebuild ONLY that partition. This is cheaper than replication.
RDD: Coarse-grained reads/writes, immutable, fault recovery via lineage, stragglers mitigated with backup tasks, overflows to disk.
DSM: Fine-grained reads/writes, mutable, fault recovery via checkpointing, hard to mitigate stragglers, performance degrades when RAM is full.
Quick answers for quick questions. Memorize these:
You've covered the most-asked Apache Spark interview topics:
β’ Always explain WHY, not just WHAT. "Spark uses in-memory processing" β "Spark uses in-memory processing to avoid expensive disk I/O between steps, making iterative algorithms dramatically faster."
β’ Use trade-offs. Never say X is always better than Y.
β’ Give examples from real use cases: fraud detection, ETL pipelines, recommendation engines.
β’ If you don't know, say "I'm not sure, but I'd reason through it like thisβ¦" β interviewers love to see thinking.