SYSTEM Cited by 1 source
Apache Spark Streaming (micro-batch stream processor)¶
Spark Streaming (later Structured Streaming) is the streaming extension of Apache Spark. Its defining architectural choice is micro-batching: instead of processing each record as it arrives (continuous event-at-a-time, like Flink or Kafka Streams), it groups records into small time-bounded batches and runs Spark's familiar RDD/DataFrame-style computation on each batch.
Key properties:
- Micro-batch interval (typical 100 ms – seconds) is configurable; defines both throughput vs freshness and the minimum end-to-end latency floor.
- Windowed aggregations are first-class —
count,sum,groupByKeyover a window are the common shapes. - Integrates natively with Kafka (and other sources) as a stream input; typically writes to Kafka or a storage sink.
Trade-offs¶
Advantages (per Hotstar, 2024):
- Support for micro-batching and aggregations fits any UX whose cadence is already discrete (e.g. a 2-second emoji-swarm refresh).
- Better community support than Flink (at 2019 decision time).
- Same SQL/DataFrame API as batch Spark — ops teams already know it.
Structural downsides (implicit; some dated):
- End-to-end latency floor = batch interval + processing time. No sub-batch-interval freshness by construction. Fine for 2-second animation; wrong for millisecond-sensitive pricing/bidding.
- Event-time semantics are weaker than Flink's watermarks historically; Structured Streaming has narrowed this gap.
- Resource shape — one Spark executor pool fights with Spark batch workloads on a shared cluster if not isolated.
Seen in¶
- sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for the emoji-swarm aggregation stage. The batch window is 2 seconds — count emojis-per-type over 2-second windows and write the aggregates back to a second Kafka topic. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
Related¶
- systems/apache-spark — the general-purpose compute engine Spark Streaming extends.
- systems/kafka — the standard upstream/downstream pair for a Spark Streaming pipeline.
- concepts/micro-batching — the processing model Spark Streaming popularised.
- concepts/streaming-aggregation — the family of computations Spark Streaming is well-suited to.
- systems/hotstar-emojis — canonical production-scale consumer.