CONCEPT Cited by 3 sources
Graceful degradation¶
Graceful degradation is the design property that when a subsystem or dependency fails, the service as a whole continues to operate in a reduced-but-useful mode, rather than propagating the failure upward into a user-visible outage. It is the architectural prerequisite that makes chaos engineering safe.
Definition¶
"We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these 'once in a blue moon' failures." — Netflix, The Netflix Simian Army (2011).
Graceful degradation is what makes the difference between "one instance / AZ / dependency failed" and "the service is down." A gracefully-degrading service substitutes reduced capability for unavailability when a failure happens.
Examples of graceful degradation¶
- Serve cached content when the backing DB is unreachable. The cache is stale, but the user gets a page instead of an error.
- Return a default recommendation when the personalisation service times out. Generic recommendations are worse than personalised; no recommendations are worse than generic.
- Skip non-essential decorations when under load. Core functionality stays fast; bells and whistles yield.
- Fall back to a secondary region when the primary is degraded. See patterns/runtime-backend-swap-on-failure.
- Return partial search results rather than waiting for all shards. See patterns/sticky-session-scatter-gather + patterns/slo-aware-early-response.
Preconditions¶
Graceful degradation requires each caller to have:
- A fallback plan — either a cached response, a default, a secondary provider, or a safely-incomplete response.
- A timeout — without a timeout, a slow dependency propagates its latency into the caller's response time.
- A circuit breaker — without one, a failed dependency gets hammered with retries, amplifying the load at the moment it is least able to absorb it.
- A decision: fail-open or fail-closed? See concepts/fail-open-vs-fail-closed. A permissive cache is fail-open (stale content is acceptable); a billing check is fail-closed (approving a transaction without verification is not acceptable).
Relationship to chaos engineering¶
Chaos engineering is the drill that validates graceful degradation in practice. The relationship is reciprocal:
- Graceful degradation is the prerequisite for safe chaos engineering. Running a Chaos Monkey on a fleet without graceful degradation will cause outages, not learning.
- Chaos engineering is the validator of graceful degradation. Designing for graceful degradation without testing it means the fallback paths atrophy (libraries change, configs drift, new services don't implement the timeout) and the first real failure exposes the rot.
Relationship to redundancy¶
Graceful degradation complements redundancy; it does not replace it. Redundancy (node-, rack-, AZ-, region-level) removes single points of failure; graceful degradation handles the remaining failure modes that can't be redundancy-engineered away (dependency slowness, partial data, regional outages that exceed the redundancy budget).
The Netflix 2011 framing lists them together: "graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments."
Seen in¶
-
sources/2026-04-21-planetscale-graceful-degradation-in-postgres — Canonical database-tier instance. Ben Dicken reframes PlanetScale Traffic Control (previously wiki-canonicalised on the mixed-workload contention axis) as the graceful- degradation lever for user-facing database traffic. Three- tier critical / important / best-effort classification + per-tier resource budgets + live-disable of the best-effort tier under spike. Dicken's canonical phrasing: "what could have been a huge lost-opportunity (your app becomes unusable) is now only a temporary degradation of non-critical functionality. We've kept our users happy and avoided an application outage." Extends the existing Netflix-centric framing (graceful degradation as a "cache the response / return a default / fall back to secondary" property) with a capacity- management instance at the database tier: when the database itself is the scarce resource under spike, grace- ful degradation is implemented by cutting the lowest- priority query class at the infrastructure layer, not by swapping backends or returning cached fallbacks. Canonical new patterns/shed-low-priority-under-load pattern and Traffic Control as the mechanism.
-
sources/2026-01-02-netflix-the-netflix-simian-army — named as the load-bearing technique the Simian Army exercises.
- sources/2022-04-27-zalando-operation-based-slos —
canonicalises the SLI-measurement interaction. Zalando
points out that graceful-degradation fallbacks break HTTP-
status-based availability SLIs: the "first fallback" returns
a "good enough" response (successful from the client's
perspective, HTTP 200), but the "second fallback" is
"no longer a response of acceptable quality" — and is
conceptually a failure even though HTTP says 200. "Even
though the response was successful from the client's
perspective, we still count it as an error." The fix:
transport-agnostic SLIs via
OpenTracing's
errortag, set by application code when it took the poor-quality fallback. Makes operation-based SLOs honest in the presence of graceful degradation — an SLI that only counts 5xx would give the system credit for a degraded response a user would consider broken. - — canonicalises the operational enabler for graceful degradation at scale. Designed-in degradation hooks (feature flags, metrics-tier drop levers, catalog-source disable switches) are only useful if someone can pull them during an incident. Zalando's answer is incident playbooks + business-owner pre-approval: the trade-off is judged once at authoring time, so the responder can execute without cross-team real-time approval. 1,200+ playbooks across 100+ on-call teams; canonical worked examples include catalog-page source-disable ordering (outfits → sponsored → teasers, least-impact first per concepts/playbook-ordering-by-business-impact) and ZMON metrics-tier drop (40% TSDB load reduction, zero business impact, 2-min MTTR per patterns/drop-non-critical-metrics-under-tsdb-overload). The operational pattern generalises: designed-in degradation without pre-approval is shelfware; pre-approval without designed-in hooks is paperwork; together they are graceful degradation at fleet scale.
- Widely referenced across the wiki as an architectural precondition for any system that runs in a cloud environment with real failure.
Related¶
- concepts/chaos-engineering
- concepts/fail-open-vs-fail-closed
- concepts/backpressure
- concepts/blast-radius
- concepts/circular-dependency
- concepts/query-priority-classification
- concepts/warn-mode-vs-enforce-mode
- concepts/operation-based-slo — Zalando's CBO SLIs count
the second-fallback case as a failure via OpenTracing's
errortag. - concepts/critical-business-operation
- systems/netflix-simian-army
- systems/netflix-latency-monkey
- systems/planetscale-traffic-control
- systems/opentracing — its
errortag is the SLI- measurement primitive that makes graceful-degradation fallbacks count correctly in availability SLOs. - patterns/partial-restart-fault-recovery
- patterns/runtime-backend-swap-on-failure
- patterns/shed-low-priority-under-load
- patterns/workload-class-resource-budget