Cloudflare — When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug¶
One-paragraph summary¶
Cloudflare engineering post (2026-05-12) on a subtle bug in
quiche — Cloudflare's open-source Rust QUIC /
HTTP/3 implementation — where
CUBIC's congestion window (cwnd) got permanently pinned at the
two-packet minimum and never recovered after a congestion-collapse
event. The bug was found via an ingress-proxy integration test that
ran a 10 MB HTTP/3 download with 30% random packet loss injected
during the first 2 seconds of a RTT = 10 ms connection. Expected
behaviour: CUBIC takes hits, drops cwnd, then ramps back up and
finishes in 4–5 s. Observed: ~60% of runs failed the generous
10-second timeout, with cwnd locked at 2,700 bytes (two full-size
packets) and CUBIC oscillating between congestion_avoidance and
recovery state 999 times in ~6.7 s — one transition every
~14 ms, suspiciously close to the connection's RTT. Root cause: a
2017 Linux-kernel optimisation that shifts CUBIC's epoch forward by
the application-idle duration (to preserve the growth-curve shape
across idle periods) was ported to quiche in 2020 without the
follow-up kernel patch (~1 week later) that prevented
epoch_start from being set into the future. In Linux TCP the fix
leans on the ACK-processing path's access to epoch_start; quiche
ports the original logic inside on_packet_sent() using
now - last_sent_time as the idle delta. At minimum cwnd,
bytes_in_flight drops to zero on every ACK cycle, and
last_sent_time is the start of the previous RTT — so the "idle"
delta is ~14 ms (the full RTT) rather than the actual ~0 ms gap
between last-ACK and next-send. Recovery-start time is pushed into
the future every send, in_congestion_recovery() returns true on
every ACK, cwnd growth is skipped, the pipe drains, and the cycle
repeats. The fix (near-one line of logic): add a last_ack_time
timestamp, update it on ACK, and compute the idle delta from
max(last_ack_time, last_sent_time) — when bytes_in_flight dips
transiently to zero between ACK and next send, the ACK is the right
anchor and the idle delta is ~0. Restored 100% test pass rate;
fix is contributed back to cloudflare/quiche.
Key takeaways¶
- The bug is invisible at high speeds and only surfaces in the
minimum-
cwndcorner of CCA state space. Verbatim: "Recovery after congestion collapse is an uncommon regime, but it is exactly the regime a congestion controller exists to handle. Most congestion control tests exercise the steady-state and growth phases of an algorithm; far fewer probe what happens at minimum cwnd, after the connection has been beaten down. Bugs in this corner of the state space are invisible in throughput dashboards, undetectable by static review, and only surface when you deliberately drive a CCA into it and watch whether it can climb back out." Canonical wiki instance of patterns/adversarial-corner-case-test-for-recovery. - The trigger is a death
spiral at
cwnd = 2 × MSS. Five-step loop (from the post): (a) sender transmits the two-packet window; (b) after one RTT, both packets are ACKed andbytes_in_flightdrops to zero; (c)on_packet_sent()seesbytes_in_flight == 0and assumes the connection was idle — but it was actually congestion- limited; (d) the idle-delta computed asnow - last_sent_timeis ~14 ms (the full RTT + rounding errors), not the true ~0-ms processing gap between the last ACK and the next send; (e)recovery_start_time += deltapushes the recovery boundary into the future,in_congestion_recovery()returns true on every incoming ACK,cwndgrowth is skipped, and the pipe drains completely on the next ACK, restarting the cycle. - The bug needs three conditions to trigger simultaneously: a
real loss event to set
congestion_recovery_start_time, congestion-avoidance (post-slow-start) to be running, andcwndcollapsed to the two-packet floor. Before exiting slow-start,congestion_recovery_start_timeis unset so the buggy branch has no boundary to advance — this is why the bug doesn't fire at connection start even thoughbytes_in_flight == 0is common there. - Reno was the control experiment. The team re-ran the exact
test with Reno swapped in — 100 % pass
rate, clean recovery after the loss phase ends at T=2s, download
completes by ~5s. Same loss regime, same timeout, same
RTT = 10 ms— only the CCA differs. That is both the smoking gun (the bug is CUBIC-specific, not a platform issue) and a textbook instance of paired CCA experiments to localise the failure to one algorithm. - The bug's RTT-matched oscillation period was the key clue.
The CCA state machine flipped between
congestion_avoidanceandrecoveryevery ~14 ms — "suspiciously close to the connection's RTT (10ms)" — which told the team the trigger was happening once per round trip, in lockstep with the ACK clock. Canonical concepts/ack-clock diagnostic instance: on a download, ACKs travel client-to-server; every time they land,bytes_in_flightdrops to zero and the server sends the next two-packet burst; that is the trigger. - Linux TCP CUBIC's 2017 fix (shift epoch forward by idle
duration, preserving curve shape) shipped with a known bug.
The canonical kernel commit
(30927520dbae)
was followed ~1 week later by a patch
(c2e7204d180f)
titled "tcp_cubic: do not set epoch_start in the future". The
follow-up commit's message names the precise failure mode:
"Tracking idle time in
bictcp_cwnd_event()is imprecise, asepoch_startis normally set at ACK processing time, not at send time." Linux TCP CUBIC has the ACK-processing path available to it; quiche's port insideon_packet_sent()does not. Canonical patterns/userspace-port-of-kernel-primitive-risk instance. - The fix is structurally tiny relative to the investigation
effort. Verbatim: "After weeks of instrumenting qlogs and
analyzing visualizations to find the root cause, the solution
required changing just three lines of code." Add
last_ack_timestate; update it on every ACK; compute idle delta asnow - max(last_ack_time, last_sent_time)inon_packet_sent(). Whenbytes_in_flightdips transiently to zero between an ACK and the next send (the bug condition),last_ack_time > last_sent_timeand the delta captures the true ~0-ms processing gap. For a genuinely idle connection (no ACKs in a while),last_ack_timeis far in the past and the original epoch-shift behaviour is preserved. Canonical patterns/measure-idle-from-last-ack-not-last-send instance. - qlog was the investigation substrate. The team instrumented quiche's qlog output with packet-loss events and built visualisations showing cwnd, bytes-in-flight, and CCA state over time. The 999-state- transitions-in-6.7-seconds count and the ~14 ms oscillation period came from that visualisation. Canonical concepts/qlog-quic-instrumentation instance — the standardised JSON event log that makes this class of bug visible at all.
- Lessons named explicitly in the post. (a) "'Idle' is harder to define than it sounds. Normal pipeline delays at small windows can look like idleness to simple checks." (b) "Minimum-cwnd dynamics are a unique corner case. The bug was invisible at high speeds and only triggered after severe loss." (c) "The fix was surprisingly small compared to the complexity of the behavior."
- Forward posture: CCA work at Cloudflare is broader than loss- based algorithms. The post's close flags that Cloudflare also runs a model-based BBRv3 implementation via quiche's modular CC design, "now enabled for a growing percentage of our QUIC deployments."
Operational numbers¶
- Test regime: localhost quiche HTTP/3 client and server,
RTT = 10 ms(configured), 10 MB file download over HTTP/3, CUBIC CCA, 30% random packet loss during the first 2 seconds, loss stops entirely after 2 s, 10-second timeout (generous vs expected 4–5 s completion). - Failure rate: ~60% of 100-run test batches failed the 10-second timeout with CUBIC.
- Oscillation count: 999 state transitions between
congestion_avoidanceandrecoveryin ~6.7 seconds — one transition every ~14 ms. - Minimum
cwnd: 2,700 bytes = 2 full-size packets. - RTT: 10 ms (configured) + ~4 ms rounding / scheduler jitter produces the observed 14 ms oscillation period.
- Reno control: 100% pass rate, completion by ~5 s.
- Fix size: three lines of logic (one new state field
last_ack_time; one update in the ACK-handling path; one delta computation inon_packet_sent()).
Caveats¶
- Measurements are on a localhost test fixture with simulated packet loss + RTT — not wide-area production telemetry. The post does not disclose production-fleet impact numbers (how many real connections detoned into the death spiral under real wide-area conditions), only that this was a CI-visible test regression.
- The post does not disclose the specific
bytes_in_flight+last_sent_time+last_ack_timeaccuracy floor under production-workload Linux-scheduler jitter — the argument "accumulation of small deviations — from scheduler jitter and ACK processing variance — lets the<=boundary inin_congestion_recovery()slip behind the next packet's send time, breaking the cycle" is given qualitatively without a distribution. - No production-scale quiche deployment number is disclosed in
this post for the fix rollout (number of fleet nodes, percent
of QUIC traffic on fixed quiche version, date of global
rollout). The fix is contributed back to
cloudflare/quicheas a public PR. - The BBRv3 "growing percentage" is qualitative; no concrete QUIC-deployment-share number is disclosed in this post (an adjacent New standards post is linked for congestion-control specifics).
- The two-packet minimum
cwndis "delayed-ACK slow-start behavior" — a specific CUBIC implementation floor; the specific numeric floor of 2,700 bytes depends on the MSS configured in quiche. Other QUIC implementations may use different minimums.
Source¶
- Original: https://blog.cloudflare.com/quic-death-spiral-fix/
- Raw markdown:
raw/cloudflare/2026-05-12-when-idle-isnt-idle-how-a-linux-kernel-optimization-became-a-bff1ad35.md - Upstream Linux-kernel references: CUBIC after-idle commit (30927520dbae), follow-up kernel fix (c2e7204d180f), RFC 9438 (CUBIC).
Related¶
- systems/quiche
- systems/cubic-congestion-control
- systems/bbrv3
- systems/tcp-reno
- concepts/cubic-epoch
- concepts/bytes-in-flight
- concepts/minimum-cwnd-death-spiral
- concepts/false-idle-detection
- concepts/ack-clock
- concepts/qlog-quic-instrumentation
- concepts/congestion-window
- concepts/user-space-congestion-control
- concepts/quic-transport
- concepts/http-3
- patterns/measure-idle-from-last-ack-not-last-send
- patterns/adversarial-corner-case-test-for-recovery
- patterns/userspace-port-of-kernel-primitive-risk
- companies/cloudflare