Skip to content

CONCEPT Cited by 2 sources

Micro-batching

Definition

Micro-batching is a stream-processing execution model that groups incoming records into small, time-bounded batches (typically 100 ms – a few seconds) and processes each batch as a mini batch job — instead of processing each record the moment it arrives (continuous / event-at-a-time, like Flink or Kafka Streams).

Canonical implementation: Apache Spark Streaming (later Structured Streaming), which reuses Spark's batch RDD / DataFrame engine on each micro-batch.

Shape

time →  ▮ ▮▮ ▮   ▮ ▮▮▮ ▮▮   ▮ ▮
         └─2s─┘ └─2s─┘ └─2s─┘ ...
           │      │      │
           ▼      ▼      ▼
        batch  batch  batch
        job    job    job
  • Batch interval (configurable, e.g. 2 s at Hotstar) determines throughput vs freshness.
  • End-to-end latency floor = batch interval + processing time. Nothing observable at the output shorter than one batch.
  • Windowed aggregations (count / sum / groupByKey over N batches) are first-class.

When it fits

Micro-batching is the right choice when the downstream UX or data contract is already discrete at a cadence of ~100 ms or longer. Continuous event-at-a-time processing adds engineering complexity without adding animation- or product-relevant freshness in those cases.

From sources/2024-03-26-highscalability-capturing-a-billion-emojions:

"Spark has support for micro batching and aggregations which are essential for our use case and better community support compared to competitors like Flink."

Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for their emoji-swarm aggregator. The swarm UI refreshes every ~2 s — exactly the batch window. Continuous processing would not have been observable to users.

When it doesn't fit

  • Sub-batch-interval freshness required (sub-100 ms trading, fraud detection mid-transaction). Flink or Kafka Streams are better fits — no per-batch scheduling latency.
  • Strict event-time semantics with watermarks. Historically Spark Streaming was weaker here than Flink; Structured Streaming has narrowed the gap.
  • Per-record side effects with low cardinality (e.g. sending one webhook per record). Batching forces you to loop inside the batch.
  • Timer-driven proactive output at sub-second precision. When processing-time timers must fire with sub-second granularity (e.g., 30-second heartbeats for 4M sessions), micro-batch rounds timer fires to batch boundaries. Spark's Real-Time Mode (2026) eliminates this constraint — same API, single trigger change, 20× latency improvement (Source: sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming).

Trade-off summary

Axis Micro-batch (Spark) Continuous (Flink)
Minimum latency batch interval (~0.1–few s) tens of ms
API familiarity for batch users high (same DataFrame API) new
Event-time / watermark maturity improving mature
Operational complexity lower (reuses batch infra) dedicated streaming infra
Recomputing windows on late data expensive cheap

Seen in

  • sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar emoji-swarm: 2-second Spark Streaming micro-batch over Kafka-ingested user emoji submissions; output aggregate counts flow to a second Kafka topic and then to PubSub WebSocket fanout. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
  • sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming — Documents micro-batch mode's limitation for timer-driven gaming sessionization: ~8.6s p99 latency because timers align to batch boundaries. Real-Time Mode (same Spark engine, trigger change only) drops latency to 432 ms p99 — the clearest production evidence that micro-batching is a ceiling, not a floor, for stateful stream processing.
Last updated · 542 distilled / 1,571 read