PATTERN Cited by 1 source
Prometheus effective-batch-size dashboard¶
Problem¶
Streaming-broker operators need a single Grafana dashboard that exposes effective batch size, request rate, scheduler backlog, and CPU utilisation at the cluster / broker / topic level — the four signals required to (a) detect whether a cluster is in the saturated regime where linger tuning pays off, and (b) identify which topic is the tiny-batch offender. Producer- side metrics cannot answer either question.
Structure¶
Five canonical PromQL queries, each paired to a panel. Canonicalised from Redpanda's 2024-11-26 batch-tuning part 2 (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2):
1. Average effective batch size, by topic¶
sum(irate(vectorized_storage_log_written_bytes{topic!~"^_.*"}[5m])) by (topic)
/
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)
- Panel type: time series, one line per topic.
- Reference lines at 4 KB (NVMe write-amplification floor) and 16 KB (Redpanda recommended sweet spot).
- The
topic!~"^_.*"filter excludes internal topics (_schemas,__consumer_offsets,__transaction_state).
2. Batch write rate per core¶
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (cluster)
/
count(redpanda_cpu_busy_seconds_total{}) by (cluster)
- Panel type: time series, one line per cluster.
- High values correlate with CPU saturation + tiny-batch workloads.
3. Batches per second, by topic¶
- Panel type: time series, one line per topic.
- The tiny-batch offender fingerprint: one topic's line orders of magnitude above the rest.
4. Scheduler queue backlog¶
- Panel type: time series, one line per
group(task category). - Main
grouprising > baseline ⇒ cluster is in the saturated regime. Linger tuning will reduce this.
5. CPU utilisation heatmap¶
- Panel type: heatmap (pods × shards).
- Thread-per-core hot spots and uneven shard distribution become visible.
Why these five compose¶
The five signals answer five distinct operational questions:
| Panel | Question answered |
|---|---|
| Effective batch size by topic | "Are we amortising fixed cost?" |
| Batch rate per core | "Is the broker thrashing on request overhead?" |
| Batches/sec by topic | "Which topic is the offender?" |
| Scheduler backlog | "Are we in the saturated regime?" |
| CPU utilisation heatmap | "Is the saturation hot-spotted or uniform?" |
Read together, they form a complete diagnostic: saturation regime (4) + offender identification (3) + confirmation of effective batch too small (1) + broker CPU-thrashing signature (2) + shard-hotspot visibility (5).
Use¶
The dashboard is the primary tool for iterative linger-tuning: before each config change, snapshot the five signals; after each change, compare deltas. The production case in the Redpanda part-2 explainer walked three linger-change iterations against exactly this dashboard, achieving p99 128 ms → 17 ms and p99.999 490 ms → 130 ms.
Seen in¶
- sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — canonical wiki source. All five PromQL queries quoted verbatim; described as "a simple Grafana visualization with a PromQL query" for each panel.
Related¶
- concepts/broker-effective-batch-size-observability — the metrics substrate.
- concepts/per-topic-batch-diagnosis — why
by (topic)matters on panels 1, 3. - concepts/batching-latency-tradeoff — panel 4 confirms regime.
- patterns/iterative-linger-tuning-production-case — the workflow this dashboard supports.
- systems/prometheus, systems/grafana, systems/redpanda.