Skip to content

CONCEPT Cited by 2 sources

Static stability

Definition

Static stability is the reliability principle that a system should continue operating with the last known good state when something fails, and should be overprovisioned so a failing part's work can be absorbed by its copies without a capacity event.

Max Englander's canonical framing (sources/2026-04-21-planetscale-the-principles-of-extreme-fault-tolerance):

"When something fails, continue operating with the last known good state. Overprovision so a failing part's work can be absorbed by its copies."

The name comes from control theory / aerodynamics: a statically stable system returns to its equilibrium after a disturbance without requiring active control input. In software, this translates to:

  1. Degrade to a known-working state rather than recompute. When the thing that would give you a fresher answer is broken, serve the last answer you knew was good.
  2. Absorb failure without fetching new resources. If the headroom to survive a failure has to be provisioned at failure time, you're depending on the provisioning-service being healthy during the exact event your system is trying to survive.

Two mechanisms

Last known good state

When a live dependency (control plane, config service, discovery service, metadata store) fails, the local component continues with whatever state it had cached at the last successful fetch. This is the pattern behind:

  • Control plane / data plane separation — data planes cache last-known control-plane state so they survive control- plane outages. Named explicitly by Englander on PlanetScale's architecture.
  • Query buffering during failover — in-flight queries are held at the proxy layer during a topology change and released against the new primary, rather than failing with "connection closed". The client's "last known good state" is "my query is accepted".
  • Sidecar-cached feature flags — sidecars cache the flag snapshot so evaluation works during flag-service outages.
  • DNS TTL — resolvers answer from cache when authoritative servers are unreachable, up to TTL.

Overprovisioning for absorption

When a part fails, its work is picked up by the remaining copies without waiting for new capacity to provision. Requires sizing the fleet so N-1 (or N-k) parts can handle the full load. Named by Englander verbatim: "Overprovision so a failing part's work can be absorbed by its copies."

Worked applications:

  • Multi-AZ database clusters (primary + ≥2 replicas across 3 AZs). The AWS AWS Well-Architected re:Invent 2018 ARC336 talk canonicalised the N+1 AZ sizing argument; PlanetScale's minimum-2-replicas-across-3-AZs rule operationalises it.
  • Pre-warmed connection pools — size the pool larger than steady-state so a connection-shortage can't materialise into failed queries.
  • Reserved instance capacity — keep warm headroom so a traffic spike or primary-failover is not a cold-start event.

Not to be confused with

  • Static routing / static allocation. Static stability is about degradation behaviour under failure, not about whether the topology is fixed at configuration time. A dynamically-routed system can be statically stable; a statically-routed system can be statically unstable.
  • Graceful degradation. Static stability is a specific class of graceful degradation — degrade by continuing with last-known-good rather than by shedding traffic or falling back to a cheaper model. Netflix's caches-and-defaults graceful- degradation is static stability; shedding low-priority traffic at a database throttler is graceful degradation but not static stability.
  • Failover. A failover is a state-transition event that tests static stability — if the system can make the transition without losing the last known good state (via query buffering, replica promotion on pre-provisioned overprovisioned headroom), it's statically stable through the transition.

Seen in

  • sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — second canonical wiki instance of static stability, instantiated at the storage-capacity axis via deliberate disk reserve: "as a reliability measure, we leave disk space unused and used-but-reclaimable (for caching), which we can reclaim if the situation warrants it." During the 2025-06-12 GCP outage the reserve absorbed the flush backlog without write-path impact — the reserve was the statically-stable buffer that let the broker continue writing on last-known-good storage state while the tiered-storage layer had elevated errors. Complements PlanetScale's compute-layer framing with storage-layer realisation.

  • PlanetScale, Max Englander, The principles of extreme fault tolerance, 2025-07-03 — canonical verbatim framing as the third of three principles (alongside isolation and redundancy). Worked applications on PlanetScale: query buffering during weekly failover, pre-provisioned replica capacity (primary + ≥2 replicas) so failovers don't need provisioning-service health, control-plane dependency isolation so data-plane continues on last-known control-plane state.

Last updated · 470 distilled / 1,213 read