Skip to content

PATTERN Cited by 1 source

Adversarial corner-case test for recovery

Pattern

Deliberately drive a stateful system into its worst-case regime and measure whether it can climb back out. Applied to congestion controllers: inject severe packet loss early in a connection's life, wait for the CCA to collapse cwnd to its minimum, stop injecting loss, and verify the connection recovers within a reasonable timeout.

Most tests exercise happy-path regimes — steady-state throughput, cold-start growth, fair-share behaviour under mild competition. Few tests exercise the corner of state space the system exists to handle: the recovery regime after something has gone very wrong.

Canonical instance: Cloudflare's quiche integration-test fixture (Source: sources/2026-05-12-cloudflare-when-idle-isnt-idle-how-a-linux-kernel-optimization-became-a-quic-bug):

The canonical test fixture

  • HTTP/3 over quiche localhost client + server.
  • RTT = 10 ms (configured).
  • 10 MB file download over HTTP/3.
  • CUBIC congestion control (the regime under test).
  • 30% random packet loss injected during the first 2 seconds of the connection.
  • After 2 s, loss stops entirely.
  • 10-second timeout. Under normal operation, the 10 MB should finish in 4–5 s.

Expected behaviour: CUBIC takes hits during the loss phase, cwnd collapses toward the minimum, loss stops, CUBIC rebuilds cwnd and finishes the download. Observed behaviour (pre- fix): ~60% of 100-run batches failed the 10 s timeout with cwnd pinned at the two-packet minimum.

The test only finds the bug because it deliberately drives the CCA into recovery-after-collapse. A normal throughput test would never trigger the minimum-cwnd condition that is the trigger's precondition.

Why this pattern specifically

The post articulates the principle explicitly:

"Recovery after congestion collapse is an uncommon regime, but it is exactly the regime a congestion controller exists to handle. Most congestion control tests exercise the steady-state and growth phases of an algorithm; far fewer probe what happens at minimum cwnd, after the connection has been beaten down. Bugs in this corner of the state space are invisible in throughput dashboards, undetectable by static review, and only surface when you deliberately drive a CCA into it and watch whether it can climb back out — which is exactly what this test did."

Three specific reasons the recovery regime is where bugs hide:

  1. Minimum-state conditions trigger branches not exercised elsewhere. A CCA's behaviour at cwnd = 2 × MSS differs in every branch (idle-period adjustment, recovery-time arithmetic, minimum-window enforcement) from its behaviour at cwnd = 1000 × MSS. Normal-state tests simply don't execute those branches.
  2. Static review can't see it. Code that correctly handles large-state cases can subtly fail at boundary-state cases without any reviewer being able to tell from the source.
  3. Throughput dashboards don't show it. A connection stuck at minimum cwnd produces low throughput; so does a connection on a slow network link. Distinguishing requires CCA-internal state inspection (e.g. via qlog).

Anti-patterns the pattern guards against

  • Happy-path-only integration tests. Great for regression detection on common paths, useless for corner-state bugs.
  • Unit tests that stub out the network. They exercise code paths but not the ACK-clock interaction that drives bugs like minimum-cwnd death spirals.
  • Fuzzing without state-machine coverage tracking. Random inputs rarely drive a complex stateful system into a specific corner; directed adversarial injection does.

Paired control experiment is a corollary

Once the test reliably fails, running the same fixture with a different algorithm / configuration localises the bug. The 2026-05-12 post's control was Reno: same 10 MB download, same 30% loss for 2 s, same 10 s timeout — Reno passed 100%. This instantly narrowed the search to CUBIC- specific logic.

Generalisation

The pattern applies beyond congestion control:

  • Database recovery from WAL replay. Force a crash mid- transaction, restart, verify the database consistency invariants hold and subsequent workload proceeds normally.
  • Distributed consensus after partition. Partition a majority, heal the partition, verify the cluster regains availability and the committed-log-prefix invariant holds.
  • Autoscaler recovery from zero capacity. Kill all replicas, verify the autoscaler scales back up within a bounded time.
  • Circuit breaker half-open probing. Drive a dependency to 100% error rate, recover the dependency, verify the breaker re-closes within a bounded number of probes.

In every case: the stated correctness claim of the subsystem is that it handles a specific kind of disaster. The test directly demonstrates that claim.

Seen in

Last updated · 542 distilled / 1,571 read