PATTERN Cited by 1 source

Iterative linger-tuning production case¶

Problem¶

A Kafka-API streaming cluster is CPU-saturated under heavy produce load. Operators suspect producer-side linger.ms is too low — producer batches are closing early (before batch.size fills), flooding the broker with tiny requests. But which linger.ms value is correct? How much latency can be reclaimed? Does the cluster need more hardware or less configuration friction?

Classic single-step guessing fails because:

Latency can temporarily rise when linger.ms is raised in the normal regime — an operator who reverts on the first signal of higher average latency misses the saturation-regime inversion.
Percentile-by-percentile behaviour under tuning is not monotonic — p50 can improve smoothly while p99.999 lags several rounds of tuning.

Solution¶

Iteratively adjust linger.ms in multiple rounds, each round evaluated against a quantitative percentile-by-percentile latency table and the Prometheus effective-batch-size dashboard. Canonicalised from Redpanda's 2024 customer case study (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2).

Round structure¶

Each round:

Pick the next linger.ms value. Move gradually (e.g. 2×, 5×, 10× the current value), not in one large jump — the saturation regime is entered / exited gradually.
Apply to producers across the fleet. If application-team owned, coordinate a deploy; if broker-owned, hot-reload.
Wait for steady state — several minutes of production traffic at the new setting.
Record percentile table: p50, p85, p95, p99, p99.999 at minimum.
Record dashboard state: per-topic effective batch size, scheduler backlog, CPU utilisation, batch-rate-per-core.
Decide: continue tuning if latency is still decreasing; stop when diminishing returns (or cluster CPU is healthily below saturation).

The Redpanda post applied three rounds over "several days" on a real Tier-7 BYOC cluster. Results table verbatim:

Percentile	Original	Change 1	Change 2	Change 3
p50	25 ms	15 ms	4 ms	< 1 ms
p85	55 ms	32 ms	17 ms	3 ms
p95	90 ms	57 ms	32 ms	6 ms
p99	128 ms	100 ms	63 ms	17 ms
p99.999	490 ms	260 ms	240 ms	130 ms

Non-obvious outcomes¶

Every percentile improved at every round. No regression at any level — evidence that the cluster was deep in the saturated regime throughout the tuning. In the normal regime, raising linger.ms would have hurt p50 while helping tail — the monotonic improvement across all percentiles is the signature of saturation-regime tuning.

The tail improves more slowly than the median. p50 dropped 25× across three rounds (25 ms → <1 ms); p99.999 dropped 3.8× (490 → 130 ms). The saturation-regime latency inversion primarily reclaims the median and low-high percentiles; extreme tail improvements require cumulative rounds.

Network bandwidth dropped 48% at identical message rate (1.1 GB/sec → 575 MB/sec for 1.2 M msg/sec). Attributed to better compression and reduced Kafka-metadata overhead per batch. Not a pre-stated goal but a clean second-order gain.

Cluster consolidation became possible. Post-tuning CPU dropped to ~50% on one cluster — the two-cluster deployment was consolidated to one handling 2.5–2.7 M msg/sec (the pre-tuning 1.2 M × 2 clusters → post-tuning 2.6 M × 1 cluster, ~2.2× throughput per cluster).

Structure by stage¶

Round 0 (baseline): measure.

Round 1: linger.ms × ~2. Measure percentile table + dashboard.
         Expect: 30–50% latency improvement across percentiles,
         effective batch size rising, scheduler backlog falling.

Round 2: linger.ms × ~5 vs. baseline. Measure.
         Expect: further 2–3× latency improvement on median,
         less on tail.

Round 3: linger.ms × ~10 vs. baseline, or topic-targeted fine-tune.
         Expect: diminishing returns — if effective batch size is
         now > 16 KB, stop.

Validation: cluster CPU well below saturation, scheduler backlog
            near zero, per-topic effective batch size above 4 KB
            for every topic.

Why multi-round¶

One-shot tuning can't answer the question "is this the right value?" — there's no counterfactual. Three rounds generates a progression curve that shows whether diminishing returns have kicked in, which percentile bands are still responsive, and whether the cluster has exited the saturated regime.

The three-round structure is also rollback-safe: if round 3 regresses, round 2's settings are a known-good fallback.

Prerequisites¶

Prometheus effective-batch-size dashboard operational before round 1.
Per-topic tracking: concepts/per-topic-batch-diagnosis discipline — if one topic has pathologically low batches, target the tune there, not cluster-wide.
Coordinated deploy window when producers are application-team-owned.
Latency measurement infrastructure that can hold percentile-table observations across rounds.

Consequences¶

Positive:

Order-of-magnitude tail-latency improvements possible (p99 7.5×, p50 25× in the case study).
Network-bandwidth reduction at identical message rate.
Cluster consolidation enabled by CPU headroom.

Negative / risks:

Producer-side per-record latency rises in the normal regime — if the cluster leaves the saturated regime mid-tuning, subsequent rounds can start hurting instead of helping.
Producer memory pressure rises with bigger linger.ms + unchanged buffer.memory — a third-dimension tuning surface.
Tuning is multi-team when producers span services.

Seen in¶

sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — canonical wiki source. 2024 Redpanda Cloud BYOC customer migration; three-round linger tuning; full percentile table with p50 / p85 / p95 / p99 / p99.999 deltas; 1.1 → 0.575 GB/sec network reduction; 2-cluster → 1-cluster consolidation.

concepts/batching-latency-tradeoff — normal-vs-saturated regime framing. This pattern is the saturated-regime execution playbook.
concepts/effective-batch-size — target state: every topic above 4 KB.
concepts/per-topic-batch-diagnosis — targeting discipline.
concepts/tail-latency-at-scale — what this pattern reclaims.
concepts/cpu-utilization-vs-saturation — the regime detector.
patterns/prometheus-effective-batch-size-dashboard — instrumentation prerequisite.
systems/redpanda, systems/kafka.