Spark Streaming

Think of Spark Streaming like a newspaper printing press that runs continuously. News (data) arrives non-stop, the press slices it into pages (micro-batches), prints each page, and delivers it β€” all without stopping.

Spark Streaming takes live data from sources like Kafka, Flume, or Kinesis, slices it into tiny batches (e.g., every 1 second), and processes each batch using the regular Spark engine.

🌊 DStream

DStream = Discretized Stream. A continuous stream represented as a sequence of RDDs, one per time interval.

πŸ”„ Micro-Batching

Data arrives continuously but is processed in small batches. Not true real-time, but near-real-time (latency in seconds).

πŸ›‘οΈ Fault Tolerant

Each micro-batch is an RDD β€” inherits RDD fault tolerance via lineage. Checkpointing adds extra safety.

DStream Transformations

DStreams support the same transformations as RDDs, plus some streaming-specific ones:

Stateless

Each batch is independent. No memory of previous batches. Examples: map(), filter(), reduceByKey()

Stateful

Processing depends on previous batches. Requires checkpointing. Examples: sliding window operations, updateStateByKey()

🎯 Sliding Window Operations

A window slides over the stream, combining data from multiple batches. E.g., "count tweets in the last 30 seconds, updating every 10 seconds." Two params: window length (30s) and slide interval (10s). This is frequently asked in interviews!

Receivers

Receivers consume data from sources and store it in Spark's memory for processing. Two types:

βœ… Reliable Receivers

Send acknowledgment to the data source AFTER data is received and replicated in Spark. Guarantees no data loss.

⚠️ Unreliable Receivers

No acknowledgment sent. Simpler but data can be lost if the receiver fails.

Structured Streaming vs DStreams

Structured Streaming (Spark 2.x+) is the modern replacement for DStreams. Think of DStreams as the flip phone and Structured Streaming as the smartphone.

πŸ“± Structured Streaming

Built on DataFrame/Dataset API. Uses Catalyst optimizer. Supports event-time windows. Better exactly-once guarantees. The recommended approach.

πŸ“Ÿ DStreams (Legacy)

Built on RDD API. Manual state management. Simpler model but less optimized. Still works but not actively improved.

⚑
Kafka: Hey Spark! Here's a stream of credit card transactions…
SS
Structured Streaming: Got it! I'll treat this as an ever-growing DataFrame. Let me run the fraud detection query…
SS
Found 3 suspicious transactions in the last 30-second window! Flagging them…
DB
Dashboard: Alert displayed! Analyst notified. Total processing time: 2.3 seconds.
SS
Checkpointed my state. If I crash, I'll pick up right where I left off. No data loss. πŸ›‘οΈ

MLlib & GraphX

Two powerful libraries that come bundled with Spark:

🧠 MLlib β€” Machine Learning

Scalable ML library. Supports classification, regression, clustering, collaborative filtering, dimensionality reduction, and more.

πŸ”— GraphX β€” Graph Processing

Process graph-structured data. PageRank, shortest paths, community detection. Uses vertex-cut partitioning for efficiency.

MLlib Key Concepts

Transformer

Takes a DataFrame IN, outputs a new DataFrame with predictions/features added. Implements transform().

Estimator

A learning algorithm. Takes data IN, produces a trained Model (which IS a Transformer). Implements fit().

Pipeline

A sequence of stages (Transformers + Estimators). Data flows through each stage in order. E.g.: Tokenize β†’ HashTF β†’ LogisticRegression.

🎯 Interview Gotcha: Spark + Reinforcement Learning

"Is Spark good for reinforcement learning?" Answer: No. Spark works well for batch ML algorithms (classification, clustering, regression) but NOT for reinforcement learning, which requires sequential interaction with an environment.

Checkpointing & WAL

Two safety nets for Spark Streaming applications:

πŸ’Ύ Checkpointing

Periodically saves streaming state to reliable storage (HDFS). Two types: Metadata (config, DStream ops, incomplete batches) and Data (actual RDD data for stateful operations).

πŸ“ Write-Ahead Log (WAL)

Before processing data, write it to a durable log. If the driver crashes, replay the log to recover. Enable with spark.streaming.receiver.writeAheadLog.enable=true

# Enable checkpointing ssc.checkpoint("hdfs:///checkpoints") # Enable WAL conf.set( "spark.streaming.receiver" ".writeAheadLog.enable", "true" ) # Auto cleanup metadata conf.set("spark.cleaner.ttl", "3600")
πŸ’Ύ Save streaming state to HDFS periodically
 
πŸ“ Turn on Write-Ahead Logging
All received data is written to a log
before being processed
If driver crashes, replay the log
 
🧹 Auto-clean metadata older than 3600s (1 hour)
Prevents memory buildup in long-running jobs

Quiz: Streaming & Libraries

Q1: What is a DStream?

A) A direct stream that connects to databases
B) A Discretized Stream β€” a continuous sequence of RDDs, each containing data from a time interval
C) A data structure for storing ML models

Q2: What's the difference between stateless and stateful DStream transformations?

A) Stateless: each batch is independent (map, filter). Stateful: processing depends on previous batches (window operations).
B) Stateless is faster, stateful is slower. That's the only difference.
C) Stateless = no output, stateful = has output

Q3: In MLlib, what's the difference between a Transformer and an Estimator?

A) They're the same thing with different names
B) Transformer: takes DataFrame, outputs new DataFrame (transform()). Estimator: takes data, trains a Model which IS a Transformer (fit()).
C) Transformer changes data types, Estimator estimates file sizes

Q4: Your Spark Streaming driver crashes. How does WAL help?

A) WAL writes received data to a durable log BEFORE processing. On restart, Spark replays the log to recover unprocessed data.
B) WAL automatically restarts the driver on another node
C) WAL sends alerts to the operations team