SYSTEM Cited by 3 sources

Apache Spark Streaming (micro-batch stream processor)¶

Spark Streaming (later Structured Streaming) is the streaming extension of Apache Spark. Its defining architectural choice is micro-batching: instead of processing each record as it arrives (continuous event-at-a-time, like Flink or Kafka Streams), it groups records into small time-bounded batches and runs Spark's familiar RDD/DataFrame-style computation on each batch.

Key properties:

Micro-batch interval (typical 100 ms – seconds) is configurable; defines both throughput vs freshness and the minimum end-to-end latency floor.
Windowed aggregations are first-class — count, sum, groupByKey over a window are the common shapes.
Integrates natively with Kafka (and other sources) as a stream input; typically writes to Kafka or a storage sink.

Real-Time Mode (2026)¶

Spark Structured Streaming now supports a Real-Time Mode that eliminates the micro-batch latency floor by processing records and firing timers with sub-second precision — event-at-a-time semantics rather than batch-at-a-time.

Key properties:

No batch interval — records are processed as they arrive; timers fire at sub-second granularity instead of waiting for the next batch boundary.
transformWithState operator — a next-generation stateful API offering object-oriented state (MapState, ListState), composite types, processing-time timers, automatic TTL, and schema evolution. A single StatefulProcessor class implements both reactive (handleInputRows) and proactive (handleExpiredTimer) paths.
Backward compatible — switching from micro-batch to Real-Time Mode is "a single trigger change" with no code rewrite.
Production numbers (gaming sessionization, 2026): 432 ms p99 end-to-end Kafka-to-Kafka, 20× faster than micro-batch on the same workload. 500K input events/min, 4M concurrent stateful sessions, 8M heartbeat outputs/min (16× amplification).

(Source: sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming)

Trade-offs¶

Advantages (per Hotstar, 2024):

Support for micro-batching and aggregations fits any UX whose cadence is already discrete (e.g. a 2-second emoji-swarm refresh).
Better community support than Flink (at 2019 decision time).
Same SQL/DataFrame API as batch Spark — ops teams already know it.

Structural downsides (implicit; some dated):

End-to-end latency floor = batch interval + processing time. No sub-batch-interval freshness by construction. Fine for 2-second animation; wrong for millisecond-sensitive pricing/bidding.
Event-time semantics are weaker than Flink's watermarks historically; Structured Streaming has narrowed this gap.
Resource shape — one Spark executor pool fights with Spark batch workloads on a shared cluster if not isolated.

Seen in¶

sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for the emoji-swarm aggregation stage. The batch window is 2 seconds — count emojis-per-type over 2-second windows and write the aggregates back to a second Kafka topic. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
sources/2026-05-29-databricks-databricks-at-sigmod-2026 — Spark Structured Streaming as the streaming-track engine inside SDP, paired with the Enzyme IVM engine on the materialized-view track. The companion VLDB 2026 paper "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs" is announced in this source — the first wiki disclosure that Structured Streaming's evolution has been published as a first-author Databricks academic paper. The blog post characterises the SDP streaming track as "a wide variety of constructs — from stateful operators to watermarks, making it easy to express complicated business logic like custom aggregations". The VLDB paper itself is not yet available; this source establishes the forward reference.
sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming — Real-Time Mode for gaming sessionization. Demonstrates transformWithState with MapState + processing-time timers tracking 4M concurrent sessions at 432 ms p99 (20× vs micro-batch). The reactive + proactive dual-path pattern: handleInputRows for GameStart/End, handleExpiredTimer for 30-second heartbeats. 16× output amplification (500K in → 8M out per minute).

systems/apache-spark — the general-purpose compute engine Spark Streaming extends.
systems/kafka — the standard upstream/downstream pair for a Spark Streaming pipeline.
concepts/micro-batching — the processing model Spark Streaming popularised.
concepts/streaming-aggregation — the family of computations Spark Streaming is well-suited to.
systems/hotstar-emojis — canonical production-scale consumer.
systems/lakeflow-spark-declarative-pipelines — Databricks' SDP uses Structured Streaming as its streaming-track engine.
systems/enzyme-ivm — the materialized-view-track engine paired with Structured Streaming inside SDP.
concepts/declarative-vs-imperative-stream-api — Structured Streaming sits on the imperative end of this distinction inside SDP.
concepts/stateful-stream-processing — the paradigm that transformWithState + Real-Time Mode makes first-class in Spark.
patterns/timer-driven-heartbeat-emission — proactive output pattern enabled by Real-Time Mode's sub-second timer precision.
patterns/reactive-plus-proactive-stream-processing — the dual-path execution model demonstrated by transformWithState.

Apache Spark Streaming (micro-batch stream processor)¶

Real-Time Mode (2026)¶

Trade-offs¶

Seen in¶

Related¶