Skip to content

CLOUDFLARE 2026-05-12 Tier 1

Read original ↗

Cloudflare — When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

One-paragraph summary

Cloudflare engineering post (2026-05-12) on a subtle bug in quiche — Cloudflare's open-source Rust QUIC / HTTP/3 implementation — where CUBIC's congestion window (cwnd) got permanently pinned at the two-packet minimum and never recovered after a congestion-collapse event. The bug was found via an ingress-proxy integration test that ran a 10 MB HTTP/3 download with 30% random packet loss injected during the first 2 seconds of a RTT = 10 ms connection. Expected behaviour: CUBIC takes hits, drops cwnd, then ramps back up and finishes in 4–5 s. Observed: ~60% of runs failed the generous 10-second timeout, with cwnd locked at 2,700 bytes (two full-size packets) and CUBIC oscillating between congestion_avoidance and recovery state 999 times in ~6.7 s — one transition every ~14 ms, suspiciously close to the connection's RTT. Root cause: a 2017 Linux-kernel optimisation that shifts CUBIC's epoch forward by the application-idle duration (to preserve the growth-curve shape across idle periods) was ported to quiche in 2020 without the follow-up kernel patch (~1 week later) that prevented epoch_start from being set into the future. In Linux TCP the fix leans on the ACK-processing path's access to epoch_start; quiche ports the original logic inside on_packet_sent() using now - last_sent_time as the idle delta. At minimum cwnd, bytes_in_flight drops to zero on every ACK cycle, and last_sent_time is the start of the previous RTT — so the "idle" delta is ~14 ms (the full RTT) rather than the actual ~0 ms gap between last-ACK and next-send. Recovery-start time is pushed into the future every send, in_congestion_recovery() returns true on every ACK, cwnd growth is skipped, the pipe drains, and the cycle repeats. The fix (near-one line of logic): add a last_ack_time timestamp, update it on ACK, and compute the idle delta from max(last_ack_time, last_sent_time) — when bytes_in_flight dips transiently to zero between ACK and next send, the ACK is the right anchor and the idle delta is ~0. Restored 100% test pass rate; fix is contributed back to cloudflare/quiche.

Key takeaways

  1. The bug is invisible at high speeds and only surfaces in the minimum-cwnd corner of CCA state space. Verbatim: "Recovery after congestion collapse is an uncommon regime, but it is exactly the regime a congestion controller exists to handle. Most congestion control tests exercise the steady-state and growth phases of an algorithm; far fewer probe what happens at minimum cwnd, after the connection has been beaten down. Bugs in this corner of the state space are invisible in throughput dashboards, undetectable by static review, and only surface when you deliberately drive a CCA into it and watch whether it can climb back out." Canonical wiki instance of patterns/adversarial-corner-case-test-for-recovery.
  2. The trigger is a death spiral at cwnd = 2 × MSS. Five-step loop (from the post): (a) sender transmits the two-packet window; (b) after one RTT, both packets are ACKed and bytes_in_flight drops to zero; (c) on_packet_sent() sees bytes_in_flight == 0 and assumes the connection was idle — but it was actually congestion- limited; (d) the idle-delta computed as now - last_sent_time is ~14 ms (the full RTT + rounding errors), not the true ~0-ms processing gap between the last ACK and the next send; (e) recovery_start_time += delta pushes the recovery boundary into the future, in_congestion_recovery() returns true on every incoming ACK, cwnd growth is skipped, and the pipe drains completely on the next ACK, restarting the cycle.
  3. The bug needs three conditions to trigger simultaneously: a real loss event to set congestion_recovery_start_time, congestion-avoidance (post-slow-start) to be running, and cwnd collapsed to the two-packet floor. Before exiting slow-start, congestion_recovery_start_time is unset so the buggy branch has no boundary to advance — this is why the bug doesn't fire at connection start even though bytes_in_flight == 0 is common there.
  4. Reno was the control experiment. The team re-ran the exact test with Reno swapped in — 100 % pass rate, clean recovery after the loss phase ends at T=2s, download completes by ~5s. Same loss regime, same timeout, same RTT = 10 ms — only the CCA differs. That is both the smoking gun (the bug is CUBIC-specific, not a platform issue) and a textbook instance of paired CCA experiments to localise the failure to one algorithm.
  5. The bug's RTT-matched oscillation period was the key clue. The CCA state machine flipped between congestion_avoidance and recovery every ~14 ms — "suspiciously close to the connection's RTT (10ms)" — which told the team the trigger was happening once per round trip, in lockstep with the ACK clock. Canonical concepts/ack-clock diagnostic instance: on a download, ACKs travel client-to-server; every time they land, bytes_in_flight drops to zero and the server sends the next two-packet burst; that is the trigger.
  6. Linux TCP CUBIC's 2017 fix (shift epoch forward by idle duration, preserving curve shape) shipped with a known bug. The canonical kernel commit (30927520dbae) was followed ~1 week later by a patch (c2e7204d180f) titled "tcp_cubic: do not set epoch_start in the future". The follow-up commit's message names the precise failure mode: "Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start is normally set at ACK processing time, not at send time." Linux TCP CUBIC has the ACK-processing path available to it; quiche's port inside on_packet_sent() does not. Canonical patterns/userspace-port-of-kernel-primitive-risk instance.
  7. The fix is structurally tiny relative to the investigation effort. Verbatim: "After weeks of instrumenting qlogs and analyzing visualizations to find the root cause, the solution required changing just three lines of code." Add last_ack_time state; update it on every ACK; compute idle delta as now - max(last_ack_time, last_sent_time) in on_packet_sent(). When bytes_in_flight dips transiently to zero between an ACK and the next send (the bug condition), last_ack_time > last_sent_time and the delta captures the true ~0-ms processing gap. For a genuinely idle connection (no ACKs in a while), last_ack_time is far in the past and the original epoch-shift behaviour is preserved. Canonical patterns/measure-idle-from-last-ack-not-last-send instance.
  8. qlog was the investigation substrate. The team instrumented quiche's qlog output with packet-loss events and built visualisations showing cwnd, bytes-in-flight, and CCA state over time. The 999-state- transitions-in-6.7-seconds count and the ~14 ms oscillation period came from that visualisation. Canonical concepts/qlog-quic-instrumentation instance — the standardised JSON event log that makes this class of bug visible at all.
  9. Lessons named explicitly in the post. (a) "'Idle' is harder to define than it sounds. Normal pipeline delays at small windows can look like idleness to simple checks." (b) "Minimum-cwnd dynamics are a unique corner case. The bug was invisible at high speeds and only triggered after severe loss." (c) "The fix was surprisingly small compared to the complexity of the behavior."
  10. Forward posture: CCA work at Cloudflare is broader than loss- based algorithms. The post's close flags that Cloudflare also runs a model-based BBRv3 implementation via quiche's modular CC design, "now enabled for a growing percentage of our QUIC deployments."

Operational numbers

  • Test regime: localhost quiche HTTP/3 client and server, RTT = 10 ms (configured), 10 MB file download over HTTP/3, CUBIC CCA, 30% random packet loss during the first 2 seconds, loss stops entirely after 2 s, 10-second timeout (generous vs expected 4–5 s completion).
  • Failure rate: ~60% of 100-run test batches failed the 10-second timeout with CUBIC.
  • Oscillation count: 999 state transitions between congestion_avoidance and recovery in ~6.7 seconds — one transition every ~14 ms.
  • Minimum cwnd: 2,700 bytes = 2 full-size packets.
  • RTT: 10 ms (configured) + ~4 ms rounding / scheduler jitter produces the observed 14 ms oscillation period.
  • Reno control: 100% pass rate, completion by ~5 s.
  • Fix size: three lines of logic (one new state field last_ack_time; one update in the ACK-handling path; one delta computation in on_packet_sent()).

Caveats

  • Measurements are on a localhost test fixture with simulated packet loss + RTT — not wide-area production telemetry. The post does not disclose production-fleet impact numbers (how many real connections detoned into the death spiral under real wide-area conditions), only that this was a CI-visible test regression.
  • The post does not disclose the specific bytes_in_flight + last_sent_time + last_ack_time accuracy floor under production-workload Linux-scheduler jitter — the argument "accumulation of small deviations — from scheduler jitter and ACK processing variance — lets the <= boundary in in_congestion_recovery() slip behind the next packet's send time, breaking the cycle" is given qualitatively without a distribution.
  • No production-scale quiche deployment number is disclosed in this post for the fix rollout (number of fleet nodes, percent of QUIC traffic on fixed quiche version, date of global rollout). The fix is contributed back to cloudflare/quiche as a public PR.
  • The BBRv3 "growing percentage" is qualitative; no concrete QUIC-deployment-share number is disclosed in this post (an adjacent New standards post is linked for congestion-control specifics).
  • The two-packet minimum cwnd is "delayed-ACK slow-start behavior" — a specific CUBIC implementation floor; the specific numeric floor of 2,700 bytes depends on the MSS configured in quiche. Other QUIC implementations may use different minimums.

Source

Last updated · 542 distilled / 1,571 read