CONCEPT Cited by 1 source
Cascading failure¶
Definition¶
A failure that grows over time via a positive feedback loop: one node's overload spreads load to remaining nodes, increasing their probability of failure, which shifts more load, repeating until the system collapses. First-principles definition from HDM Stuttgart (via sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):
"A cascading failure is a failure that increases in size over time due to a positive feedback loop. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, which in turn increases the likelihood of further system failures resulting in a vicious circle or snowball effect."
Root cause pattern¶
The most common cause is server overload, or a direct consequence of it. When load crosses a threshold:
- Per-server resources (CPU, memory, threads, connections) are exhausted.
- Latency (P99) rises; error rate climbs; health checks start to fail.
- Failed/marked-unhealthy servers drop out of the pool.
- Traffic is load-balanced to the remaining healthy servers, increasing per-server load.
- Go to step 1 on more servers.
Mitigation toolkit¶
From the HDM Stuttgart post (Source: sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):
- Add resources — first and most intuitive, but reactive.
- Avoid health-check failures/deaths — don't let a transiently-slow server be marked dead and shed to already-hot peers.
- Restart servers on thread-blocking / deadlock conditions so they can resume absorbing load.
- Drop traffic significantly, then ramp back up — let servers breathe and gradually recover instead of returning full-blast traffic that re-overloads them.
- Switch to degraded mode — drop specific classes of traffic (e.g. search but not checkout).
- Eliminate batch / bad traffic — surgery at the input layer to reduce total system load.
- Move from orchestration to choreography — pub/sub decoupling so a slow consumer doesn't back-pressure a fast producer.
Slack's 2022-02-22 outage: canonical modern example¶
Slack's postmortem (linked from the roundup):
"What caused us to go from a stable serving state to a state of overload? The answer turned out to lie in complex interactions between our application, the Vitess datastores, caching system, and our service discovery system."
Complex cross-subsystem positive-feedback loops are the dominant failure mode of modern microservice architectures.
Related¶
- concepts/blast-radius — the bounding concept for the worst-case spread.
- concepts/load-shedding-at-ingestion — entry-point mitigation.
- patterns/shadow-mode-alert-before-paging — detection without amplification.
- sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022.
- companies/highscalability.