Skip to content

SYSTEM Cited by 3 sources

Apache Spark Streaming (micro-batch stream processor)

Spark Streaming (later Structured Streaming) is the streaming extension of Apache Spark. Its defining architectural choice is micro-batching: instead of processing each record as it arrives (continuous event-at-a-time, like Flink or Kafka Streams), it groups records into small time-bounded batches and runs Spark's familiar RDD/DataFrame-style computation on each batch.

Key properties:

  • Micro-batch interval (typical 100 ms – seconds) is configurable; defines both throughput vs freshness and the minimum end-to-end latency floor.
  • Windowed aggregations are first-class — count, sum, groupByKey over a window are the common shapes.
  • Integrates natively with Kafka (and other sources) as a stream input; typically writes to Kafka or a storage sink.

Real-Time Mode (2026)

Spark Structured Streaming now supports a Real-Time Mode that eliminates the micro-batch latency floor by processing records and firing timers with sub-second precision — event-at-a-time semantics rather than batch-at-a-time.

Key properties:

  • No batch interval — records are processed as they arrive; timers fire at sub-second granularity instead of waiting for the next batch boundary.
  • transformWithState operator — a next-generation stateful API offering object-oriented state (MapState, ListState), composite types, processing-time timers, automatic TTL, and schema evolution. A single StatefulProcessor class implements both reactive (handleInputRows) and proactive (handleExpiredTimer) paths.
  • Backward compatible — switching from micro-batch to Real-Time Mode is "a single trigger change" with no code rewrite.
  • Production numbers (gaming sessionization, 2026): 432 ms p99 end-to-end Kafka-to-Kafka, 20× faster than micro-batch on the same workload. 500K input events/min, 4M concurrent stateful sessions, 8M heartbeat outputs/min (16× amplification).

(Source: sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming)

Trade-offs

Advantages (per Hotstar, 2024):

  • Support for micro-batching and aggregations fits any UX whose cadence is already discrete (e.g. a 2-second emoji-swarm refresh).
  • Better community support than Flink (at 2019 decision time).
  • Same SQL/DataFrame API as batch Spark — ops teams already know it.

Structural downsides (implicit; some dated):

  • End-to-end latency floor = batch interval + processing time. No sub-batch-interval freshness by construction. Fine for 2-second animation; wrong for millisecond-sensitive pricing/bidding.
  • Event-time semantics are weaker than Flink's watermarks historically; Structured Streaming has narrowed this gap.
  • Resource shape — one Spark executor pool fights with Spark batch workloads on a shared cluster if not isolated.

Seen in

  • sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for the emoji-swarm aggregation stage. The batch window is 2 seconds — count emojis-per-type over 2-second windows and write the aggregates back to a second Kafka topic. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
  • sources/2026-05-29-databricks-databricks-at-sigmod-2026 — Spark Structured Streaming as the streaming-track engine inside SDP, paired with the Enzyme IVM engine on the materialized-view track. The companion VLDB 2026 paper "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs" is announced in this source — the first wiki disclosure that Structured Streaming's evolution has been published as a first-author Databricks academic paper. The blog post characterises the SDP streaming track as "a wide variety of constructs — from stateful operators to watermarks, making it easy to express complicated business logic like custom aggregations". The VLDB paper itself is not yet available; this source establishes the forward reference.
  • sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gamingReal-Time Mode for gaming sessionization. Demonstrates transformWithState with MapState + processing-time timers tracking 4M concurrent sessions at 432 ms p99 (20× vs micro-batch). The reactive + proactive dual-path pattern: handleInputRows for GameStart/End, handleExpiredTimer for 30-second heartbeats. 16× output amplification (500K in → 8M out per minute).
Last updated · 542 distilled / 1,571 read