Skip to content

PATTERN Cited by 1 source

Iterative linger-tuning production case

Problem

A Kafka-API streaming cluster is CPU-saturated under heavy produce load. Operators suspect producer-side linger.ms is too low — producer batches are closing early (before batch.size fills), flooding the broker with tiny requests. But which linger.ms value is correct? How much latency can be reclaimed? Does the cluster need more hardware or less configuration friction?

Classic single-step guessing fails because:

  • Latency can temporarily rise when linger.ms is raised in the normal regime — an operator who reverts on the first signal of higher average latency misses the saturation-regime inversion.
  • Percentile-by-percentile behaviour under tuning is not monotonic — p50 can improve smoothly while p99.999 lags several rounds of tuning.

Solution

Iteratively adjust linger.ms in multiple rounds, each round evaluated against a quantitative percentile-by-percentile latency table and the Prometheus effective-batch-size dashboard. Canonicalised from Redpanda's 2024 customer case study (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2).

Round structure

Each round:

  1. Pick the next linger.ms value. Move gradually (e.g. 2×, 5×, 10× the current value), not in one large jump — the saturation regime is entered / exited gradually.
  2. Apply to producers across the fleet. If application-team owned, coordinate a deploy; if broker-owned, hot-reload.
  3. Wait for steady state — several minutes of production traffic at the new setting.
  4. Record percentile table: p50, p85, p95, p99, p99.999 at minimum.
  5. Record dashboard state: per-topic effective batch size, scheduler backlog, CPU utilisation, batch-rate-per-core.
  6. Decide: continue tuning if latency is still decreasing; stop when diminishing returns (or cluster CPU is healthily below saturation).

The Redpanda post applied three rounds over "several days" on a real Tier-7 BYOC cluster. Results table verbatim:

Percentile Original Change 1 Change 2 Change 3
p50 25 ms 15 ms 4 ms < 1 ms
p85 55 ms 32 ms 17 ms 3 ms
p95 90 ms 57 ms 32 ms 6 ms
p99 128 ms 100 ms 63 ms 17 ms
p99.999 490 ms 260 ms 240 ms 130 ms

Non-obvious outcomes

Every percentile improved at every round. No regression at any level — evidence that the cluster was deep in the saturated regime throughout the tuning. In the normal regime, raising linger.ms would have hurt p50 while helping tail — the monotonic improvement across all percentiles is the signature of saturation-regime tuning.

The tail improves more slowly than the median. p50 dropped 25× across three rounds (25 ms → <1 ms); p99.999 dropped 3.8× (490 → 130 ms). The saturation-regime latency inversion primarily reclaims the median and low-high percentiles; extreme tail improvements require cumulative rounds.

Network bandwidth dropped 48% at identical message rate (1.1 GB/sec → 575 MB/sec for 1.2 M msg/sec). Attributed to better compression and reduced Kafka-metadata overhead per batch. Not a pre-stated goal but a clean second-order gain.

Cluster consolidation became possible. Post-tuning CPU dropped to ~50% on one cluster — the two-cluster deployment was consolidated to one handling 2.5–2.7 M msg/sec (the pre-tuning 1.2 M × 2 clusters → post-tuning 2.6 M × 1 cluster, ~2.2× throughput per cluster).

Structure by stage

Round 0 (baseline): measure.

Round 1: linger.ms × ~2. Measure percentile table + dashboard.
         Expect: 30–50% latency improvement across percentiles,
         effective batch size rising, scheduler backlog falling.

Round 2: linger.ms × ~5 vs. baseline. Measure.
         Expect: further 2–3× latency improvement on median,
         less on tail.

Round 3: linger.ms × ~10 vs. baseline, or topic-targeted fine-tune.
         Expect: diminishing returns — if effective batch size is
         now > 16 KB, stop.

Validation: cluster CPU well below saturation, scheduler backlog
            near zero, per-topic effective batch size above 4 KB
            for every topic.

Why multi-round

One-shot tuning can't answer the question "is this the right value?" — there's no counterfactual. Three rounds generates a progression curve that shows whether diminishing returns have kicked in, which percentile bands are still responsive, and whether the cluster has exited the saturated regime.

The three-round structure is also rollback-safe: if round 3 regresses, round 2's settings are a known-good fallback.

Prerequisites

  • Prometheus effective-batch-size dashboard operational before round 1.
  • Per-topic tracking: concepts/per-topic-batch-diagnosis discipline — if one topic has pathologically low batches, target the tune there, not cluster-wide.
  • Coordinated deploy window when producers are application-team-owned.
  • Latency measurement infrastructure that can hold percentile-table observations across rounds.

Consequences

Positive:

  • Order-of-magnitude tail-latency improvements possible (p99 7.5×, p50 25× in the case study).
  • Network-bandwidth reduction at identical message rate.
  • Cluster consolidation enabled by CPU headroom.

Negative / risks:

  • Producer-side per-record latency rises in the normal regime — if the cluster leaves the saturated regime mid-tuning, subsequent rounds can start hurting instead of helping.
  • Producer memory pressure rises with bigger linger.ms + unchanged buffer.memory — a third-dimension tuning surface.
  • Tuning is multi-team when producers span services.

Seen in

Last updated · 470 distilled / 1,213 read