SYSTEM Cited by 3 sources
Apache Spark Streaming (micro-batch stream processor)¶
Spark Streaming (later Structured Streaming) is the streaming extension of Apache Spark. Its defining architectural choice is micro-batching: instead of processing each record as it arrives (continuous event-at-a-time, like Flink or Kafka Streams), it groups records into small time-bounded batches and runs Spark's familiar RDD/DataFrame-style computation on each batch.
Key properties:
- Micro-batch interval (typical 100 ms – seconds) is configurable; defines both throughput vs freshness and the minimum end-to-end latency floor.
- Windowed aggregations are first-class —
count,sum,groupByKeyover a window are the common shapes. - Integrates natively with Kafka (and other sources) as a stream input; typically writes to Kafka or a storage sink.
Real-Time Mode (2026)¶
Spark Structured Streaming now supports a Real-Time Mode that eliminates the micro-batch latency floor by processing records and firing timers with sub-second precision — event-at-a-time semantics rather than batch-at-a-time.
Key properties:
- No batch interval — records are processed as they arrive; timers fire at sub-second granularity instead of waiting for the next batch boundary.
transformWithStateoperator — a next-generation stateful API offering object-oriented state (MapState, ListState), composite types, processing-time timers, automatic TTL, and schema evolution. A singleStatefulProcessorclass implements both reactive (handleInputRows) and proactive (handleExpiredTimer) paths.- Backward compatible — switching from micro-batch to Real-Time Mode is "a single trigger change" with no code rewrite.
- Production numbers (gaming sessionization, 2026): 432 ms p99 end-to-end Kafka-to-Kafka, 20× faster than micro-batch on the same workload. 500K input events/min, 4M concurrent stateful sessions, 8M heartbeat outputs/min (16× amplification).
(Source: sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming)
Trade-offs¶
Advantages (per Hotstar, 2024):
- Support for micro-batching and aggregations fits any UX whose cadence is already discrete (e.g. a 2-second emoji-swarm refresh).
- Better community support than Flink (at 2019 decision time).
- Same SQL/DataFrame API as batch Spark — ops teams already know it.
Structural downsides (implicit; some dated):
- End-to-end latency floor = batch interval + processing time. No sub-batch-interval freshness by construction. Fine for 2-second animation; wrong for millisecond-sensitive pricing/bidding.
- Event-time semantics are weaker than Flink's watermarks historically; Structured Streaming has narrowed this gap.
- Resource shape — one Spark executor pool fights with Spark batch workloads on a shared cluster if not isolated.
Seen in¶
- sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for the emoji-swarm aggregation stage. The batch window is 2 seconds — count emojis-per-type over 2-second windows and write the aggregates back to a second Kafka topic. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
- sources/2026-05-29-databricks-databricks-at-sigmod-2026 — Spark Structured Streaming as the streaming-track engine inside SDP, paired with the Enzyme IVM engine on the materialized-view track. The companion VLDB 2026 paper "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs" is announced in this source — the first wiki disclosure that Structured Streaming's evolution has been published as a first-author Databricks academic paper. The blog post characterises the SDP streaming track as "a wide variety of constructs — from stateful operators to watermarks, making it easy to express complicated business logic like custom aggregations". The VLDB paper itself is not yet available; this source establishes the forward reference.
- sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming
— Real-Time Mode for gaming sessionization. Demonstrates
transformWithStatewithMapState+ processing-time timers tracking 4M concurrent sessions at 432 ms p99 (20× vs micro-batch). The reactive + proactive dual-path pattern:handleInputRowsfor GameStart/End,handleExpiredTimerfor 30-second heartbeats. 16× output amplification (500K in → 8M out per minute).
Related¶
- systems/apache-spark — the general-purpose compute engine Spark Streaming extends.
- systems/kafka — the standard upstream/downstream pair for a Spark Streaming pipeline.
- concepts/micro-batching — the processing model Spark Streaming popularised.
- concepts/streaming-aggregation — the family of computations Spark Streaming is well-suited to.
- systems/hotstar-emojis — canonical production-scale consumer.
- systems/lakeflow-spark-declarative-pipelines — Databricks' SDP uses Structured Streaming as its streaming-track engine.
- systems/enzyme-ivm — the materialized-view-track engine paired with Structured Streaming inside SDP.
- concepts/declarative-vs-imperative-stream-api — Structured Streaming sits on the imperative end of this distinction inside SDP.
- concepts/stateful-stream-processing — the paradigm that
transformWithState+ Real-Time Mode makes first-class in Spark. - patterns/timer-driven-heartbeat-emission — proactive output pattern enabled by Real-Time Mode's sub-second timer precision.
- patterns/reactive-plus-proactive-stream-processing — the
dual-path execution model demonstrated by
transformWithState.