CONCEPT Cited by 1 source

Broker-side effective-batch-size observability¶

Definition¶

Broker-side effective-batch-size observability is the practice of measuring the effective batch size arriving at a streaming broker (as opposed to the producer's configured ceiling) by dividing a byte-rate metric by a batch-count metric emitted from the broker's own telemetry surface. The ratio, computed from Prometheus counters, is the canonical operations-team answer to "is our batching actually working?"

For Redpanda, the 2024-11-26 batch-tuning explainer names the four metrics plus one public sibling verbatim (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2):

vectorized_storage_log_written_bytes — private; bytes written since process start.
vectorized_storage_log_batches_written — private; batches written since process start.
vectorized_scheduler_queue_length — private; broker's internal backlog of tasks.
redpanda_cpu_busy_seconds_total — public; CPU utilisation.

The ratio log_written_bytes / log_batches_written is the average effective batch size at the broker.

Canonical PromQL templates¶

Per-topic effective batch size (bytes / batch):

sum(irate(vectorized_storage_log_written_bytes{topic!~"^_.*"}[5m])) by (topic)
  /
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

Per-core batch write rate (batches / sec / core):

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (cluster)
  /
count(redpanda_cpu_busy_seconds_total{}) by (cluster)

Scheduler backlog:

sum(vectorized_scheduler_queue_length{}) by (cluster, group)

CPU utilisation per pod/shard:

avg(deriv(redpanda_cpu_busy_seconds_total{}[5m])) by (pod, shard)

Why broker-side beats producer-side observability¶

The producer's configured batch.size and linger.ms are ceilings, not descriptions. The producer's own metrics (e.g. kafka.producer.record-send-rate) can show sent-record counts but not how records aggregated into batches on the broker side. The broker is the only vantage point where bytes / batches reflects the effect of the full seven-factor effective-batch-size pipeline (message rate, partitioning, producer fan-out, buffer memory, backpressure, etc.).

Why `topic!~"^_.*"` filter¶

Internal Kafka/Redpanda topics (_schemas, __consumer_offsets, __transaction_state) have different byte/batch profiles than application traffic — typically very small records with strict ordering requirements. Including them in a bytes / batches average pulls the number down and hides application-topic tuning signal.

Seen in¶

sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — canonical wiki source. Names the four private + one public Prometheus metric + the five PromQL one-liners + the production case study that validates them.

concepts/effective-batch-size — what this measures.
concepts/per-topic-batch-diagnosis — why the by (topic) disaggregation is load-bearing.
concepts/batching-latency-tradeoff — normal-vs-saturated regime framing; scheduler queue length confirms regime.
patterns/prometheus-effective-batch-size-dashboard — dashboard-shape canonicalisation.
systems/prometheus, systems/grafana, systems/redpanda.