CONCEPT Cited by 2 sources

Static stability¶

Definition¶

Static stability is the reliability principle that a system should continue operating with the last known good state when something fails, and should be overprovisioned so a failing part's work can be absorbed by its copies without a capacity event.

Max Englander's canonical framing ():

"When something fails, continue operating with the last known good state. Overprovision so a failing part's work can be absorbed by its copies."

The name comes from control theory / aerodynamics: a statically stable system returns to its equilibrium after a disturbance without requiring active control input. In software, this translates to:

Degrade to a known-working state rather than recompute. When the thing that would give you a fresher answer is broken, serve the last answer you knew was good.
Absorb failure without fetching new resources. If the headroom to survive a failure has to be provisioned at failure time, you're depending on the provisioning-service being healthy during the exact event your system is trying to survive.

Two mechanisms¶

Last known good state¶

When a live dependency (control plane, config service, discovery service, metadata store) fails, the local component continues with whatever state it had cached at the last successful fetch. This is the pattern behind:

Control plane / data plane separation — data planes cache last-known control-plane state so they survive control- plane outages. Named explicitly by Englander on PlanetScale's architecture.
Query buffering during failover — in-flight queries are held at the proxy layer during a topology change and released against the new primary, rather than failing with "connection closed". The client's "last known good state" is "my query is accepted".
Sidecar-cached feature flags — sidecars cache the flag snapshot so evaluation works during flag-service outages.
DNS TTL — resolvers answer from cache when authoritative servers are unreachable, up to TTL.

Overprovisioning for absorption¶

When a part fails, its work is picked up by the remaining copies without waiting for new capacity to provision. Requires sizing the fleet so N-1 (or N-k) parts can handle the full load. Named by Englander verbatim: "Overprovision so a failing part's work can be absorbed by its copies."

Worked applications:

Multi-AZ database clusters (primary + ≥2 replicas across 3 AZs). The AWS AWS Well-Architected re:Invent 2018 ARC336 talk canonicalised the N+1 AZ sizing argument; PlanetScale's minimum-2-replicas-across-3-AZs rule operationalises it.
Pre-warmed connection pools — size the pool larger than steady-state so a connection-shortage can't materialise into failed queries.
Reserved instance capacity — keep warm headroom so a traffic spike or primary-failover is not a cold-start event.

Not to be confused with¶

Static routing / static allocation. Static stability is about degradation behaviour under failure, not about whether the topology is fixed at configuration time. A dynamically-routed system can be statically stable; a statically-routed system can be statically unstable.
Graceful degradation. Static stability is a specific class of graceful degradation — degrade by continuing with last-known-good rather than by shedding traffic or falling back to a cheaper model. Netflix's caches-and-defaults graceful- degradation is static stability; shedding low-priority traffic at a database throttler is graceful degradation but not static stability.
Failover. A failover is a state-transition event that tests static stability — if the system can make the transition without losing the last known good state (via query buffering, replica promotion on pre-provisioned overprovisioned headroom), it's statically stable through the transition.

Seen in¶

sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — second canonical wiki instance of static stability, instantiated at the storage-capacity axis via deliberate disk reserve: "as a reliability measure, we leave disk space unused and used-but-reclaimable (for caching), which we can reclaim if the situation warrants it." During the 2025-06-12 GCP outage the reserve absorbed the flush backlog without write-path impact — the reserve was the statically-stable buffer that let the broker continue writing on last-known-good storage state while the tiered-storage layer had elevated errors. Complements PlanetScale's compute-layer framing with storage-layer realisation.
**** — canonical verbatim framing as the third of three principles (alongside isolation and redundancy). Worked applications on PlanetScale: query buffering during weekly failover, pre-provisioned replica capacity (primary + ≥2 replicas) so failovers don't need provisioning-service health, control-plane dependency isolation so data-plane continues on last-known control-plane state.
— Third canonical wiki instance of static stability, tested in a real production upstream outage. Phase 1 of the 2025-10-20 AWS us-east-1 incident validated both static- stability mechanisms simultaneously: (1) last known good state — data-plane processes (MySQL / Postgres primaries, VTTablet sidecars, VTGate routers) kept serving queries using credentials and routing state already cached in-process, with no hot-path calls back into the dead control-plane chain (new-branch service → secret distribution → S3 → STS → DynamoDB); (2) overprovisioning for absorption — the multi-AZ replica topology ( primary + ≥2 replicas across 3 AZs) absorbed the phase-1 control-plane outage with zero impact because no new capacity had to be provisioned. Verbatim proof-of-life: "Throughout this period, no database branches lost capacity or connectivity." Phase 2 tested the limit of static stability: when capacity-churn operations (backup, branch creation) do need the control plane at all, they queue — static stability doesn't make them succeed, it makes the existing data plane immune. Canonical case study for the narrower wiki shape concepts/control-plane-impact-without-data-plane-impact as the operational-success outcome of static stability.
sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — Fourth canonical wiki instance of static stability, instantiated at the cloud-provider-control-plane-bypass altitude via a pre-allocated bare-metal pool with provisioning buffer. Verbatim: "We allocate a pool of big (often bare metal) instances from the cloud provider. We carry buffers to sustain cloud provider provisioning outages." The buffer of already-allocated bare-metal instances is the statically-stable primitive: when the cloud-provider compute-control-plane has an outage, Lakebase's data-plane controller keeps starting Postgres compute from already-allocated headroom for the outage's duration, decoupled from the cloud-provider control-plane on the synchronous request path. The buffer is replenished off the hot path, so a sustained cloud-provider outage depletes it gradually rather than producing an immediate customer-facing failure. Complements PlanetScale's compute-layer framing (overprovisioned replicas) and Redpanda's storage-layer framing (deliberate disk reserve) with the provisioning-layer framing: stale already-completed cloud-provider provisioning operations are the "last known good state" the data-plane controller keeps consuming during cloud-provider control-plane unavailability. The pattern is canonicalised as patterns/preallocated-bare-metal-pool-with-virtualization — paired with own-zone-redundant-storage (concepts/zone-redundant-storage) so all three traditional cloud-provider-control-plane dependencies (compute / block / network) are pre-completed off the hot path. The control-plane-as-data-plane reframe is the workload-shape forcing function for elevating static stability of the start verb to data-plane-tier discipline.

concepts/isolation-as-fault-tolerance-principle — first principle of the Englander trio
concepts/always-be-failing-over — the process that exercises static-stability-through-failover routinely
concepts/graceful-degradation — superset category
concepts/query-buffering-cutover — static-stability mechanism for in-flight queries
concepts/control-plane-data-plane-separation — architectural shape exploiting static stability
patterns/multi-az-vitess-cluster — concrete overprovisioning topology
patterns/cross-dc-semi-sync-for-durability — the durability substrate