Skip to content

CONCEPT Cited by 1 source

Graceful degradation

Graceful degradation is the design property that when a subsystem or dependency fails, the service as a whole continues to operate in a reduced-but-useful mode, rather than propagating the failure upward into a user-visible outage. It is the architectural prerequisite that makes chaos engineering safe.

Definition

"We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these 'once in a blue moon' failures." — Netflix, The Netflix Simian Army (2011).

Graceful degradation is what makes the difference between "one instance / AZ / dependency failed" and "the service is down." A gracefully-degrading service substitutes reduced capability for unavailability when a failure happens.

Examples of graceful degradation

  • Serve cached content when the backing DB is unreachable. The cache is stale, but the user gets a page instead of an error.
  • Return a default recommendation when the personalisation service times out. Generic recommendations are worse than personalised; no recommendations are worse than generic.
  • Skip non-essential decorations when under load. Core functionality stays fast; bells and whistles yield.
  • Fall back to a secondary region when the primary is degraded. See patterns/runtime-backend-swap-on-failure.
  • Return partial search results rather than waiting for all shards. See patterns/sticky-session-scatter-gather + patterns/slo-aware-early-response.

Preconditions

Graceful degradation requires each caller to have:

  • A fallback plan — either a cached response, a default, a secondary provider, or a safely-incomplete response.
  • A timeout — without a timeout, a slow dependency propagates its latency into the caller's response time.
  • A circuit breaker — without one, a failed dependency gets hammered with retries, amplifying the load at the moment it is least able to absorb it.
  • A decision: fail-open or fail-closed? See concepts/fail-open-vs-fail-closed. A permissive cache is fail-open (stale content is acceptable); a billing check is fail-closed (approving a transaction without verification is not acceptable).

Relationship to chaos engineering

Chaos engineering is the drill that validates graceful degradation in practice. The relationship is reciprocal:

  • Graceful degradation is the prerequisite for safe chaos engineering. Running a Chaos Monkey on a fleet without graceful degradation will cause outages, not learning.
  • Chaos engineering is the validator of graceful degradation. Designing for graceful degradation without testing it means the fallback paths atrophy (libraries change, configs drift, new services don't implement the timeout) and the first real failure exposes the rot.

Relationship to redundancy

Graceful degradation complements redundancy; it does not replace it. Redundancy (node-, rack-, AZ-, region-level) removes single points of failure; graceful degradation handles the remaining failure modes that can't be redundancy-engineered away (dependency slowness, partial data, regional outages that exceed the redundancy budget).

The Netflix 2011 framing lists them together: "graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments."

Seen in

  • sources/2026-01-02-netflix-the-netflix-simian-army — named as the load-bearing technique the Simian Army exercises.
  • Widely referenced across the wiki as an architectural precondition for any system that runs in a cloud environment with real failure.
Last updated · 319 distilled / 1,201 read