CONCEPT Cited by 1 source

Cascading failure¶

Definition¶

A failure that grows over time via a positive feedback loop: one node's overload spreads load to remaining nodes, increasing their probability of failure, which shifts more load, repeating until the system collapses. First-principles definition from HDM Stuttgart (via sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):

"A cascading failure is a failure that increases in size over time due to a positive feedback loop. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, which in turn increases the likelihood of further system failures resulting in a vicious circle or snowball effect."

Root cause pattern¶

The most common cause is server overload, or a direct consequence of it. When load crosses a threshold:

Per-server resources (CPU, memory, threads, connections) are exhausted.
Latency (P99) rises; error rate climbs; health checks start to fail.
Failed/marked-unhealthy servers drop out of the pool.
Traffic is load-balanced to the remaining healthy servers, increasing per-server load.
Go to step 1 on more servers.

Mitigation toolkit¶

From the HDM Stuttgart post (Source: sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022):

Add resources — first and most intuitive, but reactive.
Avoid health-check failures/deaths — don't let a transiently-slow server be marked dead and shed to already-hot peers.
Restart servers on thread-blocking / deadlock conditions so they can resume absorbing load.
Drop traffic significantly, then ramp back up — let servers breathe and gradually recover instead of returning full-blast traffic that re-overloads them.
Switch to degraded mode — drop specific classes of traffic (e.g. search but not checkout).
Eliminate batch / bad traffic — surgery at the input layer to reduce total system load.
Move from orchestration to choreography — pub/sub decoupling so a slow consumer doesn't back-pressure a fast producer.

Slack's 2022-02-22 outage: canonical modern example¶

Slack's postmortem (linked from the roundup):

"What caused us to go from a stable serving state to a state of overload? The answer turned out to lie in complex interactions between our application, the Vitess datastores, caching system, and our service discovery system."

Complex cross-subsystem positive-feedback loops are the dominant failure mode of modern microservice architectures.

concepts/blast-radius — the bounding concept for the worst-case spread.
concepts/load-shedding-at-ingestion — entry-point mitigation.
patterns/shadow-mode-alert-before-paging — detection without amplification.
sources/2022-07-11-highscalability-stuff-the-internet-says-on-scalability-for-july-11th-2022.
companies/highscalability.

Cascading failure¶

Definition¶

Root cause pattern¶

Mitigation toolkit¶

Slack's 2022-02-22 outage: canonical modern example¶

Related¶