Skip to content

CONCEPT Cited by 1 source

Per-topic batch diagnosis

Definition

Per-topic batch diagnosis is the discipline of measuring effective batch size disaggregated by topic, not as a cluster-wide average. The motivating observation: an aggregate cluster bytes / batches figure that looks healthy can still hide several high-volume topics with sub-4 KB batches that are the actual source of CPU saturation and tail-latency blowup.

Canonicalised on the wiki from Redpanda's 2024-11-26 part-2 batch-tuning retrospective — the load-bearing diagnostic insight from a real-customer cluster-consolidation investigation. (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2)

The aggregation trap

Verbatim from the post, describing the customer's initial situation:

"Initially, everyone believed that the effective batch size was close to the configured batch size of their producers. What nobody had accounted for, however, was the multiple use cases flowing through the cluster, each contributing its own shape to the traffic. In the sizing sessions, we originally evaluated behavior based on aggregate cluster volume and throughput, not the individual impact of heavier-weight topics, nor had this taken into account an extremely small producer linger configuration."

The discovery:

"As we dug into the per-topic effective batch size, we noticed that some high volume topics were batching well below their expected size. These topics created hundreds of thousands of tiny batches, driving up the Redpanda request rates. In turn, this was causing a backlog of requests to stack up on all brokers, driving up all latencies from median to tail."

Why aggregation hides the offender

Two mechanisms:

  1. Volume-weighted average pulls toward the big topics. A cluster with 10 topics, 9 at healthy 16 KB batches and 1 at 512 B batches, can show a weighted average of ~14 KB — above the 4 KB NVMe floor. But the 1 sub-4 KB topic is doing all the damage.
  2. CPU cost is not linear in batch size. One topic's millions of tiny batches can consume more CPU than all the healthy topics combined, even if it's a small fraction of bytes. A byte-weighted average can't see this.

The PromQL discipline

Every effective-batch-size query in a production dashboard must carry a by (topic) clause. Canonical template from the post:

sum(irate(vectorized_storage_log_written_bytes{topic!~"^_.*"}[5m])) by (topic)
  /
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

The companion:

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

A single topic with batches_written rate orders of magnitude above the rest is the visual fingerprint of the offender.

Operator workflow

  1. Start with cluster-aggregate. Check whether any problem exists (scheduler backlog, CPU utilisation).
  2. If aggregate looks healthy, disaggregate by topic. Never stop at the cluster view. The aggregate can be a false negative.
  3. Rank topics by batch-rate. High batch-rate × low per-batch bytes = tiny-batch offender.
  4. Fix per-topic linger or batch.size on the offending topic's producer. Not cluster-wide.
  5. Verify scheduler backlog + tail latency fell. The aggregate metrics will usually improve, but the per-topic view is how you confirm the fix landed on the right topic.

Seen in

Last updated · 470 distilled / 1,213 read