CONCEPT Cited by 1 source

Micro-batching¶

Definition¶

Micro-batching is a stream-processing execution model that groups incoming records into small, time-bounded batches (typically 100 ms – a few seconds) and processes each batch as a mini batch job — instead of processing each record the moment it arrives (continuous / event-at-a-time, like Flink or Kafka Streams).

Canonical implementation: Apache Spark Streaming (later Structured Streaming), which reuses Spark's batch RDD / DataFrame engine on each micro-batch.

Shape¶

time →  ▮ ▮▮ ▮   ▮ ▮▮▮ ▮▮   ▮ ▮
         └─2s─┘ └─2s─┘ └─2s─┘ ...
           │      │      │
           ▼      ▼      ▼
        batch  batch  batch
        job    job    job

Batch interval (configurable, e.g. 2 s at Hotstar) determines throughput vs freshness.
End-to-end latency floor = batch interval + processing time. Nothing observable at the output shorter than one batch.
Windowed aggregations (count / sum / groupByKey over N batches) are first-class.

When it fits¶

Micro-batching is the right choice when the downstream UX or data contract is already discrete at a cadence of ~100 ms or longer. Continuous event-at-a-time processing adds engineering complexity without adding animation- or product-relevant freshness in those cases.

From sources/2024-03-26-highscalability-capturing-a-billion-emojions:

"Spark has support for micro batching and aggregations which are essential for our use case and better community support compared to competitors like Flink."

Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for their emoji-swarm aggregator. The swarm UI refreshes every ~2 s — exactly the batch window. Continuous processing would not have been observable to users.

When it doesn't fit¶

Sub-batch-interval freshness required (sub-100 ms trading, fraud detection mid-transaction). Flink or Kafka Streams are better fits — no per-batch scheduling latency.
Strict event-time semantics with watermarks. Historically Spark Streaming was weaker here than Flink; Structured Streaming has narrowed the gap.
Per-record side effects with low cardinality (e.g. sending one webhook per record). Batching forces you to loop inside the batch.

Trade-off summary¶

Axis	Micro-batch (Spark)	Continuous (Flink)
Minimum latency	batch interval (~0.1–few s)	tens of ms
API familiarity for batch users	high (same DataFrame API)	new
Event-time / watermark maturity	improving	mature
Operational complexity	lower (reuses batch infra)	dedicated streaming infra
Recomputing windows on late data	expensive	cheap

Seen in¶

sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar emoji-swarm: 2-second Spark Streaming micro-batch over Kafka-ingested user emoji submissions; output aggregate counts flow to a second Kafka topic and then to PubSub WebSocket fanout. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."

systems/spark-streaming — canonical micro-batch stream processor.
systems/apache-spark — the underlying batch engine micro-batch reuses.
systems/apache-flink — continuous-processing alternative; common point of comparison.
concepts/streaming-aggregation — larger family of stream-aggregation patterns micro-batching implements.
patterns/emoji-swarm-realtime-aggregation — end-to-end pipeline pattern where the micro-batch stage sits between Kafka ingest and top-N fanout.