CONCEPT

Partial failure¶

Definition¶

Partial failure is the state of a distributed system where a component is neither fully up nor fully down — it is running but delivering degraded correctness, throughput, or latency. Unlike binary failure (crash, unavailable, unreachable), partial failure cannot be detected with a boolean health check and often propagates upward in surprising ways.

"In complex systems, failure isn't a binary outcome. Cloud native systems are built without single paths of failure, but partial failure can still result in degraded performance, loss of user-facing availability, and undefined behavior. Often, minor failure in one part of the stack appears as a full failure in others." (Source: .)

Canonical upward-propagation example¶

From the PlanetScale post:

If a single instance inside of a multi-node distributed caching system runs out of networking resources, the downstream application will interpret error cases as cache misses to avoid failing the request. This will overwhelm the database when the application floods it with queries to fetch data as though it was missing. In this, a partial failure at one level results in a full failure of the database tier, causing downtime.

One cache node is mostly working (it's up, but not answering); the fallback path (return miss) is correct in isolation; the composition produces a full cache bypass → thundering herd → DB overload. None of the components failed independently.

Subtypes on this wiki¶

Slow-is-failure — a component stays up but its latency spikes so high that upstream clients see timeouts or users see 500s. The wiki's canonical instance is concepts/slow-is-failure via the PlanetScale EBS case.
Correlated partial failure — a whole set of components partially-degrades simultaneously. See concepts/correlated-failure + concepts/correlated-ebs-failure.
Noisy-neighbor-induced — variance from a co-tenant on a shared substrate. See concepts/noisy-neighbor.
Tail-latency at scale — the same underlying variance, amplified by fan-out. See concepts/tail-latency-at-scale and concepts/blast-radius-multiplier-at-fleet-scale.

Design implications¶

Boolean health checks are insufficient. "Is the process running?" does not surface a 200 → 500 ms/op latency spike. Latency + saturation + error-rate monitoring is required.
Graceful-degradation logic composes badly. Every component's fallback is correct locally but the aggregate behaviour may amplify the partial failure (see cache-miss example above). See concepts/graceful-degradation.
Clamp the impact window rather than eliminate failure. The dominant customer-facing reliability investment becomes how fast can we detect and react, not "can we make this never fail". See patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation.

Seen in¶

— canonical wiki instance. Partial failure of an EBS volume (200–500 ms latency spike, idle % → 0) appears as full failure to the application and user. The cache-node cache-miss thundering-herd example is the load-bearing upward-propagation datum.