Skip to content

SYSTEM Cited by 1 source

Apache Spark Streaming (micro-batch stream processor)

Spark Streaming (later Structured Streaming) is the streaming extension of Apache Spark. Its defining architectural choice is micro-batching: instead of processing each record as it arrives (continuous event-at-a-time, like Flink or Kafka Streams), it groups records into small time-bounded batches and runs Spark's familiar RDD/DataFrame-style computation on each batch.

Key properties:

  • Micro-batch interval (typical 100 ms – seconds) is configurable; defines both throughput vs freshness and the minimum end-to-end latency floor.
  • Windowed aggregations are first-class — count, sum, groupByKey over a window are the common shapes.
  • Integrates natively with Kafka (and other sources) as a stream input; typically writes to Kafka or a storage sink.

Trade-offs

Advantages (per Hotstar, 2024):

  • Support for micro-batching and aggregations fits any UX whose cadence is already discrete (e.g. a 2-second emoji-swarm refresh).
  • Better community support than Flink (at 2019 decision time).
  • Same SQL/DataFrame API as batch Spark — ops teams already know it.

Structural downsides (implicit; some dated):

  • End-to-end latency floor = batch interval + processing time. No sub-batch-interval freshness by construction. Fine for 2-second animation; wrong for millisecond-sensitive pricing/bidding.
  • Event-time semantics are weaker than Flink's watermarks historically; Structured Streaming has narrowed this gap.
  • Resource shape — one Spark executor pool fights with Spark batch workloads on a shared cluster if not isolated.

Seen in

  • sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for the emoji-swarm aggregation stage. The batch window is 2 seconds — count emojis-per-type over 2-second windows and write the aggregates back to a second Kafka topic. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
Last updated · 319 distilled / 1,201 read