CONCEPT Cited by 1 source

Uplink saturation from backoff¶

What it is¶

A failure mode where a retry loop on one dependency accidentally write-amplifies a second dependency, and the second dependency's write path saturates the network uplinks. The hazard is that the retry loop is usually read-only (or "harmless reconnect"), but the retry invocation's side effect in the second system is write-traffic — and the retry rate is the fleet size times the retry cadence, which can dwarf normal write load.

Mechanism¶

Preconditions:

Two systems share a code path at the client. A single operation in your application touches System A (the dependency that breaks) and System B (the dependency that ends up saturating).
System A goes down. Clients enter their reconnect/backoff loop.
Each reconnect attempt re-executes the shared code path. Every retry of the A-reconnect quietly re-invokes the write path to B.
B has no independent rate-limit against client retries.

Now every node in a fleet of thousands is retrying multiple times per second, each retry producing a write to B. B's write throughput scales linearly with fleet size × retry rate — which can trivially exceed any provisioned uplink.

Canonical wiki instance — Fly.io Consul cert expiry¶

For a long time Fly.io ran Consul and Corrosion side-by-side for resiliency. A Consul mTLS certificate expired; every worker severed its Consul connection and entered a reconnect backoff loop. Each reconnect attempt re-invoked a code path to update Fly Machine state. That code path incurred a Corrosion write.

"By the time we've figured out what the hell is happening, we're literally saturating our uplinks almost everywhere in our fleet. We apologize to our uplink providers." (sources/2025-10-22-flyio-corrosion)

Consul (the dead system) wasn't the saturating load. Corrosion (the healthy system, a side-effect of the retry loop) was.

Mitigations¶

Independent rate-limiters on any dependency's write path. Don't assume that because System A's retries don't affect System A's throughput, they don't affect anything else.
Audit shared code paths — at least at the clients — for write amplification on retry.
Circuit breakers on shared-code-path failure. If A is unhealthy, break the shared path entirely rather than keeping B-writes active as a side-effect.
Jittered exponential backoff is necessary but not sufficient — the saturation is driven by the write amplification, not just the retry rate.
Design for "what if my passive dependency becomes write-amplifying?" — an underappreciated class of blast-radius question.

Structural lesson¶

A distributed system's worst failure mode is often when a passive dependency (retried over and over to re-establish connectivity) accidentally becomes a write-amplifying dependency via retry feedback. The cost of the retry wasn't in the dependency; it was in the side effect.

Seen in¶

sources/2025-10-22-flyio-corrosion — canonical primary source. The Consul-cert-expiry incident is the wiki's named instance.

concepts/thundering-herd — the classical sibling (many clients hitting one dependency at once).
systems/consul — the dependency whose outage triggered the shape at Fly.io.
systems/corrosion-swim — the write-amplification target.
concepts/blast-radius — the generalised framing.