Think of Spark Streaming like a newspaper printing press that runs continuously. News (data) arrives non-stop, the press slices it into pages (micro-batches), prints each page, and delivers it β all without stopping.
Spark Streaming takes live data from sources like Kafka, Flume, or Kinesis, slices it into tiny batches (e.g., every 1 second), and processes each batch using the regular Spark engine.
DStream = Discretized Stream. A continuous stream represented as a sequence of RDDs, one per time interval.
Data arrives continuously but is processed in small batches. Not true real-time, but near-real-time (latency in seconds).
Each micro-batch is an RDD β inherits RDD fault tolerance via lineage. Checkpointing adds extra safety.
DStreams support the same transformations as RDDs, plus some streaming-specific ones:
Each batch is independent. No memory of previous batches. Examples: map(), filter(), reduceByKey()
Processing depends on previous batches. Requires checkpointing. Examples: sliding window operations, updateStateByKey()
A window slides over the stream, combining data from multiple batches. E.g., "count tweets in the last 30 seconds, updating every 10 seconds." Two params: window length (30s) and slide interval (10s). This is frequently asked in interviews!
Receivers consume data from sources and store it in Spark's memory for processing. Two types:
Send acknowledgment to the data source AFTER data is received and replicated in Spark. Guarantees no data loss.
No acknowledgment sent. Simpler but data can be lost if the receiver fails.
Structured Streaming (Spark 2.x+) is the modern replacement for DStreams. Think of DStreams as the flip phone and Structured Streaming as the smartphone.
Built on DataFrame/Dataset API. Uses Catalyst optimizer. Supports event-time windows. Better exactly-once guarantees. The recommended approach.
Built on RDD API. Manual state management. Simpler model but less optimized. Still works but not actively improved.
Two powerful libraries that come bundled with Spark:
Scalable ML library. Supports classification, regression, clustering, collaborative filtering, dimensionality reduction, and more.
Process graph-structured data. PageRank, shortest paths, community detection. Uses vertex-cut partitioning for efficiency.
Takes a DataFrame IN, outputs a new DataFrame with predictions/features added. Implements transform().
A learning algorithm. Takes data IN, produces a trained Model (which IS a Transformer). Implements fit().
A sequence of stages (Transformers + Estimators). Data flows through each stage in order. E.g.: Tokenize β HashTF β LogisticRegression.
"Is Spark good for reinforcement learning?" Answer: No. Spark works well for batch ML algorithms (classification, clustering, regression) but NOT for reinforcement learning, which requires sequential interaction with an environment.
Two safety nets for Spark Streaming applications:
Periodically saves streaming state to reliable storage (HDFS). Two types: Metadata (config, DStream ops, incomplete batches) and Data (actual RDD data for stateful operations).
Before processing data, write it to a durable log. If the driver crashes, replay the log to recover. Enable with spark.streaming.receiver.writeAheadLog.enable=true