PATTERN Cited by 1 source

Prometheus effective-batch-size dashboard¶

Problem¶

Streaming-broker operators need a single Grafana dashboard that exposes effective batch size, request rate, scheduler backlog, and CPU utilisation at the cluster / broker / topic level — the four signals required to (a) detect whether a cluster is in the saturated regime where linger tuning pays off, and (b) identify which topic is the tiny-batch offender. Producer- side metrics cannot answer either question.

Structure¶

Five canonical PromQL queries, each paired to a panel. Canonicalised from Redpanda's 2024-11-26 batch-tuning part 2 (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2):

1. Average effective batch size, by topic¶

sum(irate(vectorized_storage_log_written_bytes{topic!~"^_.*"}[5m])) by (topic)
  /
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

Panel type: time series, one line per topic.
Reference lines at 4 KB (NVMe write-amplification floor) and 16 KB (Redpanda recommended sweet spot).
The topic!~"^_.*" filter excludes internal topics (_schemas, __consumer_offsets, __transaction_state).

2. Batch write rate per core¶

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (cluster)
  /
count(redpanda_cpu_busy_seconds_total{}) by (cluster)

Panel type: time series, one line per cluster.
High values correlate with CPU saturation + tiny-batch workloads.

3. Batches per second, by topic¶

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

Panel type: time series, one line per topic.
The tiny-batch offender fingerprint: one topic's line orders of magnitude above the rest.

4. Scheduler queue backlog¶

sum(vectorized_scheduler_queue_length{}) by (cluster, group)

Panel type: time series, one line per group (task category).
Main group rising > baseline ⇒ cluster is in the saturated regime. Linger tuning will reduce this.

5. CPU utilisation heatmap¶

avg(deriv(redpanda_cpu_busy_seconds_total{}[5m])) by (pod, shard)

Panel type: heatmap (pods × shards).
Thread-per-core hot spots and uneven shard distribution become visible.

Why these five compose¶

The five signals answer five distinct operational questions:

Panel	Question answered
Effective batch size by topic	"Are we amortising fixed cost?"
Batch rate per core	"Is the broker thrashing on request overhead?"
Batches/sec by topic	"Which topic is the offender?"
Scheduler backlog	"Are we in the saturated regime?"
CPU utilisation heatmap	"Is the saturation hot-spotted or uniform?"

Read together, they form a complete diagnostic: saturation regime (4) + offender identification (3) + confirmation of effective batch too small (1) + broker CPU-thrashing signature (2) + shard-hotspot visibility (5).

Use¶

The dashboard is the primary tool for iterative linger-tuning: before each config change, snapshot the five signals; after each change, compare deltas. The production case in the Redpanda part-2 explainer walked three linger-change iterations against exactly this dashboard, achieving p99 128 ms → 17 ms and p99.999 490 ms → 130 ms.

Seen in¶

sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — canonical wiki source. All five PromQL queries quoted verbatim; described as "a simple Grafana visualization with a PromQL query" for each panel.

concepts/broker-effective-batch-size-observability — the metrics substrate.
concepts/per-topic-batch-diagnosis — why by (topic) matters on panels 1, 3.
concepts/batching-latency-tradeoff — panel 4 confirms regime.
patterns/iterative-linger-tuning-production-case — the workflow this dashboard supports.
systems/prometheus, systems/grafana, systems/redpanda.