Skip to content

CONCEPT Cited by 1 source

Cascading failure

Definition

A failure that grows over time via a positive feedback loop: one node's overload spreads load to remaining nodes, increasing their probability of failure, which shifts more load, repeating until the system collapses. First-principles definition from HDM Stuttgart (via sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):

"A cascading failure is a failure that increases in size over time due to a positive feedback loop. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, which in turn increases the likelihood of further system failures resulting in a vicious circle or snowball effect."

Root cause pattern

The most common cause is server overload, or a direct consequence of it. When load crosses a threshold:

  1. Per-server resources (CPU, memory, threads, connections) are exhausted.
  2. Latency (P99) rises; error rate climbs; health checks start to fail.
  3. Failed/marked-unhealthy servers drop out of the pool.
  4. Traffic is load-balanced to the remaining healthy servers, increasing per-server load.
  5. Go to step 1 on more servers.

Mitigation toolkit

From the HDM Stuttgart post (Source: sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):

  • Add resources — first and most intuitive, but reactive.
  • Avoid health-check failures/deaths — don't let a transiently-slow server be marked dead and shed to already-hot peers.
  • Restart servers on thread-blocking / deadlock conditions so they can resume absorbing load.
  • Drop traffic significantly, then ramp back up — let servers breathe and gradually recover instead of returning full-blast traffic that re-overloads them.
  • Switch to degraded mode — drop specific classes of traffic (e.g. search but not checkout).
  • Eliminate batch / bad traffic — surgery at the input layer to reduce total system load.
  • Move from orchestration to choreography — pub/sub decoupling so a slow consumer doesn't back-pressure a fast producer.

Slack's 2022-02-22 outage: canonical modern example

Slack's postmortem (linked from the roundup):

"What caused us to go from a stable serving state to a state of overload? The answer turned out to lie in complex interactions between our application, the Vitess datastores, caching system, and our service discovery system."

Complex cross-subsystem positive-feedback loops are the dominant failure mode of modern microservice architectures.

Last updated · 517 distilled / 1,221 read