Skip to content

CONCEPT Cited by 1 source

Micro-batching

Definition

Micro-batching is a stream-processing execution model that groups incoming records into small, time-bounded batches (typically 100 ms – a few seconds) and processes each batch as a mini batch job — instead of processing each record the moment it arrives (continuous / event-at-a-time, like Flink or Kafka Streams).

Canonical implementation: Apache Spark Streaming (later Structured Streaming), which reuses Spark's batch RDD / DataFrame engine on each micro-batch.

Shape

time →  ▮ ▮▮ ▮   ▮ ▮▮▮ ▮▮   ▮ ▮
         └─2s─┘ └─2s─┘ └─2s─┘ ...
           │      │      │
           ▼      ▼      ▼
        batch  batch  batch
        job    job    job
  • Batch interval (configurable, e.g. 2 s at Hotstar) determines throughput vs freshness.
  • End-to-end latency floor = batch interval + processing time. Nothing observable at the output shorter than one batch.
  • Windowed aggregations (count / sum / groupByKey over N batches) are first-class.

When it fits

Micro-batching is the right choice when the downstream UX or data contract is already discrete at a cadence of ~100 ms or longer. Continuous event-at-a-time processing adds engineering complexity without adding animation- or product-relevant freshness in those cases.

From sources/2024-03-26-highscalability-capturing-a-billion-emojions:

"Spark has support for micro batching and aggregations which are essential for our use case and better community support compared to competitors like Flink."

Hotstar picked Spark Streaming over Flink, Storm, and Kafka Streams for their emoji-swarm aggregator. The swarm UI refreshes every ~2 s — exactly the batch window. Continuous processing would not have been observable to users.

When it doesn't fit

  • Sub-batch-interval freshness required (sub-100 ms trading, fraud detection mid-transaction). Flink or Kafka Streams are better fits — no per-batch scheduling latency.
  • Strict event-time semantics with watermarks. Historically Spark Streaming was weaker here than Flink; Structured Streaming has narrowed the gap.
  • Per-record side effects with low cardinality (e.g. sending one webhook per record). Batching forces you to loop inside the batch.

Trade-off summary

Axis Micro-batch (Spark) Continuous (Flink)
Minimum latency batch interval (~0.1–few s) tens of ms
API familiarity for batch users high (same DataFrame API) new
Event-time / watermark maturity improving mature
Operational complexity lower (reuses batch infra) dedicated streaming infra
Recomputing windows on late data expensive cheap

Seen in

  • sources/2024-03-26-highscalability-capturing-a-billion-emojions — Hotstar emoji-swarm: 2-second Spark Streaming micro-batch over Kafka-ingested user emoji submissions; output aggregate counts flow to a second Kafka topic and then to PubSub WebSocket fanout. Canonical wiki instance of "pick micro-batching when the UX cadence is already discrete."
Last updated · 517 distilled / 1,221 read