Skip to content

CONCEPT Cited by 1 source

Uplink saturation from backoff

What it is

A failure mode where a retry loop on one dependency accidentally write-amplifies a second dependency, and the second dependency's write path saturates the network uplinks. The hazard is that the retry loop is usually read-only (or "harmless reconnect"), but the retry invocation's side effect in the second system is write-traffic — and the retry rate is the fleet size times the retry cadence, which can dwarf normal write load.

Mechanism

Preconditions:

  1. Two systems share a code path at the client. A single operation in your application touches System A (the dependency that breaks) and System B (the dependency that ends up saturating).
  2. System A goes down. Clients enter their reconnect/backoff loop.
  3. Each reconnect attempt re-executes the shared code path. Every retry of the A-reconnect quietly re-invokes the write path to B.
  4. B has no independent rate-limit against client retries.

Now every node in a fleet of thousands is retrying multiple times per second, each retry producing a write to B. B's write throughput scales linearly with fleet size × retry rate — which can trivially exceed any provisioned uplink.

Canonical wiki instance — Fly.io Consul cert expiry

For a long time Fly.io ran Consul and Corrosion side-by-side for resiliency. A Consul mTLS certificate expired; every worker severed its Consul connection and entered a reconnect backoff loop. Each reconnect attempt re-invoked a code path to update Fly Machine state. That code path incurred a Corrosion write.

"By the time we've figured out what the hell is happening, we're literally saturating our uplinks almost everywhere in our fleet. We apologize to our uplink providers." (sources/2025-10-22-flyio-corrosion)

Consul (the dead system) wasn't the saturating load. Corrosion (the healthy system, a side-effect of the retry loop) was.

Mitigations

  • Independent rate-limiters on any dependency's write path. Don't assume that because System A's retries don't affect System A's throughput, they don't affect anything else.
  • Audit shared code paths — at least at the clients — for write amplification on retry.
  • Circuit breakers on shared-code-path failure. If A is unhealthy, break the shared path entirely rather than keeping B-writes active as a side-effect.
  • Jittered exponential backoff is necessary but not sufficient — the saturation is driven by the write amplification, not just the retry rate.
  • Design for "what if my passive dependency becomes write-amplifying?" — an underappreciated class of blast-radius question.

Structural lesson

A distributed system's worst failure mode is often when a passive dependency (retried over and over to re-establish connectivity) accidentally becomes a write-amplifying dependency via retry feedback. The cost of the retry wasn't in the dependency; it was in the side effect.

Seen in

Last updated · 200 distilled / 1,178 read