CONCEPT Cited by 1 source
Uplink saturation from backoff¶
What it is¶
A failure mode where a retry loop on one dependency accidentally write-amplifies a second dependency, and the second dependency's write path saturates the network uplinks. The hazard is that the retry loop is usually read-only (or "harmless reconnect"), but the retry invocation's side effect in the second system is write-traffic — and the retry rate is the fleet size times the retry cadence, which can dwarf normal write load.
Mechanism¶
Preconditions:
- Two systems share a code path at the client. A single operation in your application touches System A (the dependency that breaks) and System B (the dependency that ends up saturating).
- System A goes down. Clients enter their reconnect/backoff loop.
- Each reconnect attempt re-executes the shared code path. Every retry of the A-reconnect quietly re-invokes the write path to B.
- B has no independent rate-limit against client retries.
Now every node in a fleet of thousands is retrying multiple times per second, each retry producing a write to B. B's write throughput scales linearly with fleet size × retry rate — which can trivially exceed any provisioned uplink.
Canonical wiki instance — Fly.io Consul cert expiry¶
For a long time Fly.io ran Consul and Corrosion side-by-side for resiliency. A Consul mTLS certificate expired; every worker severed its Consul connection and entered a reconnect backoff loop. Each reconnect attempt re-invoked a code path to update Fly Machine state. That code path incurred a Corrosion write.
"By the time we've figured out what the hell is happening, we're literally saturating our uplinks almost everywhere in our fleet. We apologize to our uplink providers." (sources/2025-10-22-flyio-corrosion)
Consul (the dead system) wasn't the saturating load. Corrosion (the healthy system, a side-effect of the retry loop) was.
Mitigations¶
- Independent rate-limiters on any dependency's write path. Don't assume that because System A's retries don't affect System A's throughput, they don't affect anything else.
- Audit shared code paths — at least at the clients — for write amplification on retry.
- Circuit breakers on shared-code-path failure. If A is unhealthy, break the shared path entirely rather than keeping B-writes active as a side-effect.
- Jittered exponential backoff is necessary but not sufficient — the saturation is driven by the write amplification, not just the retry rate.
- Design for "what if my passive dependency becomes write-amplifying?" — an underappreciated class of blast-radius question.
Structural lesson¶
A distributed system's worst failure mode is often when a passive dependency (retried over and over to re-establish connectivity) accidentally becomes a write-amplifying dependency via retry feedback. The cost of the retry wasn't in the dependency; it was in the side effect.
Seen in¶
- sources/2025-10-22-flyio-corrosion — canonical primary source. The Consul-cert-expiry incident is the wiki's named instance.
Related¶
- concepts/thundering-herd — the classical sibling (many clients hitting one dependency at once).
- systems/consul — the dependency whose outage triggered the shape at Fly.io.
- systems/corrosion-swim — the write-amplification target.
- concepts/blast-radius — the generalised framing.