Redpanda — Batch tuning in Redpanda to optimize performance (part 2)¶

Summary¶

Part 2 of James Kinley's two-part batch-tuning series for Redpanda (and, by Kafka-API compatibility, Apache Kafka). Where part 1 built the first- principles framework (fixed-vs-variable request cost, seven-factor effective-batch-size, normal-vs-saturated regime, backpressure-inflated batches), part 2 is the operations-manual companion: (1) the four Prometheus private metrics (plus one public) that let operators measure effective batch size at the cluster, broker, and per-topic level; (2) canonical PromQL formulas for average effective batch size, per-core batch write rate, scheduler backlog, and CPU utilisation; (3) a target number — ≥ 4 KB, ideally ≥ 16 KB — motivated by NVMe SSD 4 KB page-alignment write amplification; (4) write caching as the broker-side mitigation when client-side tuning is unavailable (relaxes to Kafka-legacy durability, ack-on-memory, batched flush); and (5) a real production case study with quantitative linger-tuning gains across three config changes: p50 25 ms → < 1 ms, p99 128 ms → 17 ms, p99.999 490 ms → 130 ms; network bandwidth 1.1 GB/sec → 575 MB/sec for the same 1.2 M msg/sec flow; and ultimately collapsing a two-cluster deployment into one cluster holding 2.5–2.7 M msg/sec at ~50% CPU.

Tier-3 substrate-qualifying on multiple axes: Prometheus observability cookbook, NVMe-storage write-amplification mechanism explanation, broker-level durability-relaxation feature disclosure (write caching), and a production incident-retrospective-shape case study with percentile-by-percentile latency deltas. The per-topic-diagnosis insight — that aggregate-cluster effective batch size hides per-topic tiny-batch offenders — is the load- bearing operational contribution that the ingested wiki corpus had been missing.

Key takeaways¶

Four private + one public Prometheus metric define the effective-batch-size observability surface. Verbatim:
- vectorized_storage_log_written_bytes — "private metric for the number of bytes written since process start".
- vectorized_storage_log_batches_written — "private metric for the count of message batches since process start".
- vectorized_scheduler_queue_length — "private metric for the broker's internal backlog of tasks to work through".
- redpanda_cpu_busy_seconds_total — "public metric for evaluating CPU utilization". The ratio log_written_bytes / log_batches_written is the effective batch size — canonicalised as concepts/broker-effective-batch-size-observability. (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2)
Per-topic breakdown is load-bearing. Aggregate cluster numbers hide topic-level offenders. Verbatim production framing: "As we dug into the per-topic effective batch size, we noticed that some high volume topics were batching well below their expected size. These topics created hundreds of thousands of tiny batches, driving up the Redpanda request rates." The canonical PromQL pattern is sum(irate(...)) by (topic) with the topic!~"^_.*" filter (excluding internal _schemas / __consumer_offsets). Canonicalised as concepts/per-topic-batch-diagnosis.
Target effective batch size: ≥ 4 KB, ideally ≥ 16 KB. The 4 KB floor is dictated by NVMe SSD page alignment, not broker-internal preference. Verbatim: "NVMe storage tends to write out data in 4 KB aligned pages. No problem if your message batch is 4 KB or larger. But what happens if you're sending millions of tiny, single message batches per second? Each message will be written alone in a 4 KB sized write, no matter how small it is: causing a large degree of write amplification and inefficient use of the available disk IO." Canonicalised as concepts/small-batch-nvme-write-amplification.
Write caching is the broker-side mitigation for client-side-untunable workloads. Verbatim: "Write caching shifts the broker's behavior when writing out messages to disk. In Redpanda's default mode, where write caching is off, the batch is sent to the leader broker, it writes it to disk, replicates it to the replica brokers, and once they acknowledge the successful write to disk, the leader confirms back to the client… With write caching enabled, the same procedure happens, but the brokers acknowledge as soon as the message is in memory. Once a quorum is reached, the leader acknowledges back to the client, and the brokers can flush larger blocks of messages to disk, taking better advantage of the storage." Durability trade-off stated: "the data durability guarantees are relaxed but no worse than a legacy Kafka cluster." Canonicalised as concepts/broker-write-caching + the pattern patterns/broker-write-caching-as-client-tuning-substitute.
Production case study — three-round linger tuning under CPU saturation. 2024 customer migration to Redpanda Cloud BYOC. Workload 1.75–2 GB/sec, several million msg/sec on a Tier-7 cluster (1.2 GB/sec produce ceiling), split across two clusters at 50% load each because one cluster saturated CPU at

90%. Root cause: "extremely small producer linger configuration" never allowed the configured batch.size to fill. Three-round linger tuning measured against four percentile levels: | Percentile | Original | Change 1 | Change 2 | Change 3 | |---|---|---|---|---| | p50 | 25 ms | 15 ms | 4 ms | < 1 ms | | p85 | 55 ms | 32 ms | 17 ms | 3 ms | | p95 | 90 ms | 57 ms | 32 ms | 6 ms | | p99 | 128 ms | 100 ms | 63 ms | 17 ms | | p99.999 | 490 ms | 260 ms | 240 ms | 130 ms | Order-of-magnitude tail-latency improvements at every percentile. Canonicalised as patterns/iterative-linger-tuning-production-case.
Network-bandwidth reduction as a second-order effect. "After we completed the effective batch size improvements, we went from requiring about 1.1 GB/sec of bandwidth to handle 1.2 million messages per second to about 575 MB/sec of bandwidth." ~48% bandwidth reduction at identical message rate. Mechanism verbatim: "improved compression of the payloads, as well as a reduction of various Kafka metadata overhead." Canonical datum that batching is not just a CPU / latency primitive — it's a network-cost primitive too.
Cluster consolidation follow-on. Post-tuning headroom enabled collapsing the two Tier-7 clusters into one that handled 2.5–2.7 M msg/sec at ~50% CPU on 1 GB/sec of network bandwidth — roughly double the throughput of the pre-tuning two-cluster deployment at half the hardware cost. Verbatim: "capacity to spare for use by other features like Redpanda data transforms or the Apache Iceberg integration."
Scheduler queue length is the load-bearing saturation-regime symptom. "The primary queue backlog was for the 'main' tasks, where produce requests are handled inside the broker. The higher this line, the more contention there is for tasks requiring work to be accomplished. Ideally, we want this as low as possible. This tends to correlate with request volume when CPU is saturated." Scheduler queue length is the metric that confirms whether a system is in the saturated regime where the linger-tuning latency inversion applies.

Architectural framing¶

The Prometheus observability surface¶

Four PromQL one-liners that the post canonicalises verbatim:

Average effective batch size by topic (bytes/batch):

sum(irate(vectorized_storage_log_written_bytes{topic!~"^_.*"}[5m])) by (topic)
  /
sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

Average batch write rate per core (batches/sec/core):

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (cluster)
  /
count(redpanda_cpu_busy_seconds_total{}) by (cluster)

Batches written per second per topic:

sum(irate(vectorized_storage_log_batches_written{topic!~"^_.*"}[5m])) by (topic)

Size of the scheduler backlog:

sum(vectorized_scheduler_queue_length{}) by (cluster,group)

CPU utilisation per pod/shard:

avg(deriv(redpanda_cpu_busy_seconds_total{}[5m])) by (pod,shard)

All five compose into a single Grafana panel the post describes as the operator's primary tuning dashboard. The topic!~"^_.*" regex filter excludes Redpanda's internal control-plane topics (_schemas, __consumer_offsets, __transaction_state, etc.) which have a different byte/batch profile than application traffic.

The 4 KB NVMe write-amplification floor¶

The explainer's substrate argument for why 4 KB is the batch- size floor:

NVMe SSDs perform page-aligned writes at the 4 KB (page-alignment) granularity.
A single-message batch smaller than 4 KB still consumes one full 4 KB page — the slack at the end of the page is wasted.
At millions of batches per second, the write-amplification factor (physical bytes written / logical bytes) climbs super-linearly.
CPU-saturation compounds: each small batch is also a full request on the CPU side, so small batches saturate CPU and disk-IO simultaneously.

This connects producer-side batching economics to storage write amplification as a single-axis optimisation target. Canonicalised as concepts/small-batch-nvme-write-amplification.

Write caching — the broker-side escape hatch¶

When client-side tuning is unavailable ("your architecture makes it hard to do client-side tuning, change the producer behavior, or adjust topic and partition design"), Redpanda's write-caching feature relaxes broker durability semantics to approximate Kafka's legacy buffer-cache behaviour.

Default mode (write caching off):

Client → leader → write to disk → replicate to followers
  → followers write to disk + ack → leader ack to client

Write caching on:

Client → leader → write to memory → replicate to followers
  → followers ack on in-memory-write → leader ack to client
  → (background) brokers flush larger blocks to disk

The similarity to Kafka's default behaviour is load-bearing — the explainer frames write caching as "similar in function to how Kafka commits batches to in-memory buffer cache and then periodically flushes that cache out to disk." Enabling write caching on Redpanda makes its durability guarantees equivalent to a legacy Kafka cluster (no worse, no better).

Mechanisms this combines:

Ack-on-memory for quorum reaches → lower per-request latency by removing the disk-fsync tax from the commit path.
Batched flush → the broker aggregates many small in-memory writes into a single large storage write, reclaiming the large-sequential-write economics that small-batch workloads forfeit.
Storage controller analogy: "similar to how a storage controller might take many small writes and assemble them into a single large write to preserve storage IOPS capacity."

Canonicalised as concepts/broker-write-caching + the pattern patterns/broker-write-caching-as-client-tuning-substitute.

The production case study — three-round linger tuning¶

Timeline reconstruction from the post:

Early 2024: Customer migration from Kafka to Redpanda Cloud BYOC. Calculated workload 1.75–2 GB/sec, several million msg/sec. Initial load-in onto Tier-7 cluster (1.2 GB/sec produce ceiling). At 50% load ported, CPU saturated >90% during peak hours. Decision: split load across two Tier-7 clusters (at 50% load each) rather than move to a higher tier, for "other capacity and timeline requirements."

Summer 2024: Customer returns asking how to consolidate back to one cluster (next tier up) to reduce infrastructure costs.

Discovery phase: Joint investigation with the customer's application team. "Initially, everyone believed that the effective batch size was close to the configured batch size of their producers. What nobody had accounted for, however, was the multiple use cases flowing through the cluster, each contributing its own shape to the traffic. In the sizing sessions, we originally evaluated behavior based on aggregate cluster volume and throughput, not the individual impact of heavier-weight topics, nor had this taken into account an extremely small producer linger configuration." — the per-topic diagnosis canonical statement.

Root cause: "The extremely low linger configuration never allowed the configured batch size to fill, so the batch was always triggering early."

Fix rollout: Three separate linger adjustments across the customer's application fleet over "several days", each evaluated against the observability dashboard before proceeding to the next. Latency results in the percentile table in takeaway 5.

Post-fix: Effective batch size above 4 KB across all high-volume topics, scheduler queue length near zero, CPU utilisation dropped enough to re-consolidate the workload onto a single cluster. Validation: 2.5–2.7 M msg/sec at ~50% CPU on one Tier-7 cluster (vs. the original 1.2 M msg/sec at >90% CPU across two clusters).

Shape: Quantitatively: - Message rate: ~1.2 M → 2.6 M (2.2×) on one cluster. - CPU: >90% × 2 clusters → ~50% × 1 cluster. - Tail latency (p99): 128 ms → 17 ms (~7.5× improvement). - Network bandwidth: 1.1 GB/sec → 575 MB/sec at identical 1.2 M msg/sec (~48% reduction); 1 GB/sec at 2.6 M msg/sec post-consolidation.

Per-topic diagnosis as the load-bearing insight¶

The explainer's diagnostic breakthrough is the observation that aggregate-cluster effective batch size can be healthy while individual topics are pathologically small:

"In the sizing sessions, we originally evaluated behavior based on aggregate cluster volume and throughput, not the individual impact of heavier-weight topics, nor had this taken into account an extremely small producer linger configuration."

"As we dug into the per-topic effective batch size, we noticed that some high volume topics were batching well below their expected size."

The implication for operators: the PromQL queries must by (topic) disaggregate — a cluster-wide log_written_bytes / log_batches_written average that looks fine can still hide several high-volume topics with sub-4 KB batches that are driving the CPU saturation.

Canonicalised as concepts/per-topic-batch-diagnosis with the concrete PromQL template + patterns/prometheus-effective-batch-size-dashboard for the full dashboard shape.

Systems touched¶

systems/redpanda — subject of the explainer. Production case-study details the real-customer BYOC cluster lifecycle.
systems/kafka — Kafka-API-compatible; batching semantics and the linger.ms / batch.size trigger logic apply identically to Kafka clients.
systems/prometheus — the observability substrate Redpanda emits metrics for. Named explicitly with private vs. public endpoint distinction.
systems/grafana — named as the visualisation layer ("We can use a simple Grafana visualization with a PromQL query…").
systems/nvme-ssd — the substrate that dictates the 4 KB batch-size floor.

Concepts touched¶

concepts/effective-batch-size — part-1 concept; part 2 extends it with the broker-side measurement view (vs. part-1's producer-side configuration view).
concepts/batching-latency-tradeoff — part-1 concept; part 2 supplies the quantitative percentile-by-percentile production case study that validates the normal-vs-saturated regime framing.
concepts/producer-backpressure-batch-growth — part-1 concept; part 2 is the retrospective that measured the saturation regime with scheduler queue length.
concepts/fixed-vs-variable-request-cost — part-1 concept; part 2's network-bandwidth-48%-reduction datum is the variable- cost side quantified.
concepts/broker-effective-batch-size-observability — new canonical concept for the four Prometheus private/public metrics + PromQL cookbook.
concepts/small-batch-nvme-write-amplification — new canonical concept connecting producer-side batch size < 4 KB to NVMe storage write amplification.
concepts/broker-write-caching — new canonical concept for the broker-side mitigation when client tuning is unavailable; Kafka-buffer-cache equivalence.
concepts/per-topic-batch-diagnosis — new canonical concept for the per-topic vs. aggregate-cluster measurement dichotomy.
concepts/cpu-utilization-vs-saturation — existing; scheduler queue length is the canonical saturation-regime symptom.
concepts/tail-latency-at-scale — existing; companion framing, now quantified by the three-round tuning case study.
concepts/write-amplification — existing; small-batch-NVMe is a new instance.
concepts/disk-block-size-alignment — existing; 4 KB page alignment is the substrate.

Patterns touched¶

patterns/batch-over-network-to-broker — existing; part 2 quantifies the network-cost axis.
patterns/prometheus-effective-batch-size-dashboard — new canonical pattern for the five-PromQL-query Grafana dashboard.
patterns/iterative-linger-tuning-production-case — new canonical pattern for the three-round-with-latency-table production tuning workflow.
patterns/broker-write-caching-as-client-tuning-substitute — new canonical pattern for choosing broker-side caching when producer fleet isn't tunable.

Operational numbers¶

Production case-study deltas¶

Dimension	Pre-tuning	Post-tuning
Clusters needed	2 × Tier-7	1 × Tier-7
CPU utilisation	>90% at peak	~50%
Msg/sec per cluster	~1.2 M (at 50% load each)	2.5–2.7 M
Network bandwidth	1.1 GB/sec @ 1.2 M msg/sec	575 MB/sec @ 1.2 M msg/sec; 1 GB/sec @ 2.6 M msg/sec
p50 producer latency	25 ms	< 1 ms
p85	55 ms	3 ms
p95	90 ms	6 ms
p99	128 ms	17 ms
p99.999	490 ms	130 ms

Batch-size guidance¶

Target	Rationale
≥ 4 KB	NVMe page-alignment floor — below this, write amplification dominates
≥ 16 KB	"to really unlock performance" — the authors' recommended sweet spot

Tier-7 cluster ceiling¶

Metric	Value
Produce ceiling	1.2 GB/sec
Observed peak (pre-tuning, one cluster)	~0.6 GB/sec at >90% CPU
Observed peak (post-tuning, one cluster)	~1.3 GB/sec (at 2.6 M msg/sec × variable) at ~50% CPU

Caveats¶

Vendor-blog voice, single case study. The production example is one customer, anonymised. No broader fleet distribution of linger-tuning outcomes.
Percentile table is a presentation. The post doesn't disclose sampling window, measurement tool, or client-side vs. broker-side measurement point. Latencies are labelled "average Redpanda producer latency at each percentile for each tuning test" — likely broker-measured, not end-to-end.
Three linger values are not stated. The post says "three separate adjustments of linger time" but does not publish the specific linger.ms values at each change. Reproducibility is impossible from the post alone.
Write-caching default-off is a design choice, not a benchmark. The post asserts durability equivalence to legacy Kafka but does not benchmark per-commit latency with vs. without write caching. Durability regression conditions (simultaneous leader+follower memory loss before flush) are not enumerated.
4 KB NVMe-alignment claim is stated, not proven. The post asserts the 4 KB page-alignment write-amplification mechanism without showing NVMe-level metrics (device write rate, wear leveling indicators, fsync counters). The mechanism is well-established in the storage literature; this post is asserting it applies, not demonstrating it.
Per-topic diagnosis insight is a customer-investigation narrative. Not a tooling or alerting recommendation. Operators have to know to look at per-topic breakdowns; no automated detection of small-batch offenders is described.
"Adding brokers can decrease batch size" (from part 1) is not revisited. The production case study does add broker capacity (consolidation into a bigger cluster post-tuning) but the batch-size behaviour of that specific transition isn't disclosed.
Tier-7 cluster dimensions are opaque. Redpanda Cloud's tier nomenclature isn't decoded in the post (no per-tier CPU core count, RAM, storage bandwidth). Only the 1.2 GB/sec produce ceiling is stated.
Iceberg topics + data transforms capacity-spare claim is marketing framing. "capacity to spare for use by other features" — no throughput-overhead numbers for those features disclosed.
Compression ratio improvement is qualitative. The network-bandwidth halving is attributed to "improved compression of the payloads, as well as a reduction of various Kafka metadata overhead" but the attribution split isn't quantified (how much of 1.1 → 0.575 GB/sec is compression vs. metadata-overhead reduction).
Grafana panel screenshots aren't reproduced in text. The post includes graph images (CPU heatmap, per-topic batch rate, scheduler backlog) but the markdown corpus captures only image captions. Operators must visit the original post for visual references.

Cross-source continuity¶

Part 1 is the substrate; part 2 is the operations manual. sources/2024-11-19-redpanda-batch-tuning-in-redpanda-for-optimized-performance-part-1 canonicalised the seven-factor effective-batch-size framework; part 2 canonicalises how to measure and tune it with a real case study.
Composes with Kozlovski 2024-05-09 Kafka 101 substrate-altitude explainer: Kozlovski's broker-side sequential-write framing is exactly what write-caching reinstates for small-batch workloads.
Composes with Olson's EBS retrospective on queueing theory: the scheduler-queue-length observation here is a micro-scale example of the macro-scale cross-layer-queue interference Olson describes.
Companion to Dicken's io-devices-and-latency substrate post on NVMe: the 4 KB page-alignment write- amplification mechanism is the direct producer-side consequence of Dicken's "50 μs NVMe round-trip + 4 KB page-aligned writes" substrate.
Quantitative complement to concepts/tail-latency-at-scale: the p99.999 490 ms → 130 ms delta is a rare disclosed quantitative instance of reducing request rate under saturation shrinking the extreme tail.
Operational shape sibling to Noach's throttler anatomy: both are series-format broker-internal-metrics deep-dives that canonicalise observability-driven tuning as the load-bearing operational discipline.