PlanetScale — The principles of extreme fault tolerance¶
Summary¶
Max Englander's canonical principles essay on the fault-tolerance axis of PlanetScale — the companion manifesto to Shlomi Noach's 2022-05-09 operational-relational-schema-paradigm principles post. Where Noach canonicalised the schema-change paradigm, Englander canonicalises the reliability paradigm — three axiomatic principles (isolation, redundancy, static stability), the architecture that emerges from them (control plane / data plane split + multi-AZ database clusters), the processes that reinforce them (Always be failing over + synchronous replication + progressive delivery), and the specific failure-mode taxonomy the composition tolerates (dependency-service failure, VM/storage failure, AZ failure, region failure, Vitess-operator bug).
Canonical opening framing:
"Our fault tolerance is built on top of principles, processes, and architectures that are easy to understand, but require painstaking work to do well. … Our principles are neither new nor radical. You may find them obvious. Even so, they are foundational for our fault tolerance. Every capability we add, and every optimization we make, is either bound by or born from these principles."
Canonical architectural claim verbatim:
"Parts in the critical path have as few dependencies as possible."
And on static stability:
"When something fails, continue operating with the last known good state. Overprovision so a failing part's work can be absorbed by its copies."
This is a principles essay, not an architecture disclosure — mechanisms are named (MySQL semi-sync, query buffering, feature flags, Vitess, Kubernetes operator) but the post's contribution is the principles- processes-architecture taxonomy into which the mechanisms fit. Its wiki value is therefore as the foundational philosophy post on PlanetScale's reliability model — every subsequent Englander / Lambert / Noach fault-tolerance ingest references this framework implicitly.
Key takeaways¶
-
Three principles: isolation, redundancy, static stability verbatim:
"Isolation. Systems are made from parts that are as physically and logically independent as possible. Failures in one part do not cascade into failures in an independent part. Parts in the critical path have as few dependencies as possible. Redundancy. Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other. Static stability. When something fails, continue operating with the last known good state. Overprovision so a failing part's work can be absorbed by its copies."
Canonicalised as three new concept pages: concepts/isolation-as-fault-tolerance-principle, redundancy (no dedicated page — content folded into the isolation + patterns/always-be-failing-over-drill pages), concepts/static-stability.
-
Control plane / data plane split as the architectural shape that emerges from the principles. Control plane = "Database creation, billing, etc.", multi-AZ redundant, "less critical than the data plane, and so has more dependencies" — including a PlanetScale database as its own metadata store. Data plane = "Stores database data and serves customer application queries", composed of a query routing layer (VTGate) and database clusters; "Does not depend on the control plane". The asymmetry of dependency is load-bearing: data plane can survive total control-plane failure; control plane cannot survive data-plane failure (it eats its own dog food). Englander's phrasing canonicalised verbatim:
"The most critical plane, with fewer dependencies than the control plane. Does not depend on the control plane."
-
Database cluster = primary + minimum 2 replicas across 3 AZs, with automatic failover. Verbatim:
"Composed of a primary instance and a minimum of two replicas. Each instance is composed of a VM and storage residing in the data plane. Instances evenly distributed across three availability zones. Automatic failovers from primaries to healthy replicas in response to failures."
This is the concrete embodiment of patterns/multi-az-vitess-cluster — 3-AZ placement with Vitess Operator failover — with canonical minimum-2-replicas datum now attributed to the post. Optional read-only regions via Portals; Enterprise-tier optional promotion of read-only regions to primary.
-
Always be failing over as canonical reliability process. Verbatim:
"Very mature ability to fail over from a failing database primary to a healthy replica. Exercise this ability every week on every customer database as we ship changes. In the event of failing hardware or a network failure — fairly common in a big system running on the cloud — we automatically and aggressively fail over. Query buffering minimizes or eliminates disruption during failovers."
Canonical framing: turn failover into a well-worn code path by exercising it every week — canonicalised as the new patterns/always-be-failing-over-drill pattern. The "every week on every customer database" cadence is the load-bearing datum — not a dry-run drill on a test fleet, but the production shipping mechanism itself. Sibling of Netflix's Simian Army continuous fault injection at a different altitude (Netflix = random failure; PlanetScale = deliberate failover per shipping cycle).
-
Semi-sync replication as failover-enabler, not just durability primitive. Verbatim:
"MySQL semi-sync replication, Postgres synchronous commits. Commits stored durably on at least one replica before primary sends acknowledgment to the client. Enables us to treat replicas as potential primaries, and fail over to them immediately as needed."
Canonical reframe of semi-sync at the architectural altitude: not just "durability survives primary loss" (covered on concepts/mysql-semi-sync-replication page), but "any replica can be promoted immediately without data-loss risk" — semi-sync is the substrate that makes the weekly patterns/always-be-failing-over-drill safe. Without semi-sync, every failover would be a data-loss roll-of-the-dice; with semi-sync, failover is a topology change.
-
Progressive delivery — database by database, via feature flags. Verbatim:
"Data plane changes are shipped gradually to progressively critical environments. Database cluster config and binary changes are shipped database by database using feature flags. Release channels allow us to ship changes to dev branches first, and to wait a week or more before shipping those same changes to production branches. Minimizes the impact of our own mistakes on our customers."
Canonicalised as new concepts/progressive-delivery-per-database concept — per-database feature-flag gating as the fleet-rollout discipline for a database-as-a-service vendor. Distinguished from generic progressive-delivery (cohort-percentage, canary-deployment) by the per-tenant-cell granularity: blast radius of a bad rollout is capped at one customer database at a time, not a percentage of traffic. Dev-branch-first release channel + week-minimum soak amplifies the discipline.
-
Failure-mode taxonomy the composition tolerates. Canonical enumeration verbatim, each mode named with the architectural property that neutralises it:
- Dependency-service failure. Critical-path data plane has extremely few dependencies, so "a hypothetical failure in one of our cloud providers' Docker registry services might impact our ability to create new database instances, but will not impact existing instances' ability to serve queries or store data." Extends the already-canonical concepts/control-plane-impact-without-data-plane-impact concept with a concrete Docker-registry example.
- VM / storage failure. "If a block storage database instance has a failing VM, the elastic volume is detached from that VM and reattached to a new, healthy VM. If a PlanetScale Metal database instance has a failing VM, we surge a replacement instance with a new VM and local NVMe drive, and destroy the failing instance once its replacement is healthy." First canonical wiki disclosure of the block-storage vs Metal failure-mode divergence: EBS-backed instances use EBS volume remount; Metal instances use NVMe surge-then-destroy (since NVMe is physically attached to the failing VM).
- Availability-zone failure. Primary failover to replica in healthy AZ; query routing layer shifts traffic to healthy zones.
- Region failure. "If an entire region goes down, so do database clusters running in that region. However, database clusters running in other regions are unaffected. Enterprise customers have the ability to initiate a failover to one of their read-only regions." Canonical framing: regional fault tolerance is opt-in at the Enterprise tier via Portals read-only-to-primary promotion, not automatic.
- Software bug in Vitess or operator. "A bug in Vitess or the PlanetScale Kubernetes operator rarely impacts more than 1-2 customers, thanks to our extensive use of feature flags to roll out changes. A failure resulting from an infrastructure change, like a Kubernetes upgrade, can have a bigger impact, but very rarely does because of how rigorously we test and gradually we roll out." Canonical framing for per-database progressive delivery as the blast-radius-cap mechanism against the vendor's own bugs.
-
Critical-path dependency minimisation as explicit design goal. The data plane deliberately depends on nothing except the customer's infrastructure and the storage layer. The control plane is allowed more dependencies (including a PlanetScale database for its own metadata) because it is acceptable for the control plane to be down. This inverts the usual intuition — "more-critical = more-redundant" — in favour of "more-critical = fewer-dependencies", because redundancy only helps for failure modes where the redundant copies aren't also taking a dependency.
-
Static stability specific application: query buffering during failover. Failover itself is a cutover event, and query buffering at the VTGate layer lets in-flight queries survive the topology change. This is the static-stability principle applied to an in-progress operation: the client's "last known good state" is "my query was accepted and will complete"; the router preserves that by holding the query across the failover boundary rather than returning an error.
Systems¶
- systems/planetscale — the umbrella product this post is about.
- PlanetScale Metal — local-NVMe variant with its own VM/storage failure-mode (surge-then- destroy rather than EBS remount).
- PlanetScale Portals — read-only regions + optional Enterprise-tier promotion to primary, the substrate for regional fault tolerance.
- systems/vitess — the sharding / proxy / failover substrate.
- Vitess Operator — the Kubernetes operator that executes automatic failover + reconciles topology.
- systems/mysql — semi-sync replication mode cited as durability substrate.
- systems/postgresql — "Postgres synchronous commits" cited as the Postgres equivalent.
- systems/aws-ebs — EBS volume remount as VM-failure recovery mechanism for block-storage instances.
- systems/kubernetes — the substrate under the Vitess Operator.
Concepts¶
- concepts/isolation-as-fault-tolerance-principle — new. The first of the three principles: physical + logical independence, no cascading failures, dependency minimisation.
- concepts/static-stability — new. The third of the three principles: continue with last-known-good state; overprovision so a failing part's work is absorbed by its copies.
- concepts/always-be-failing-over — new. The reliability-process principle that failover should be exercised routinely, not kept as an emergency-only path.
- concepts/progressive-delivery-per-database — new. The fleet-rollout discipline — feature-flag gating per customer database, with dev-branch-first release channels.
- concepts/control-plane-data-plane-separation — extended. Englander's verbatim phrasing of the asymmetry-of-dependency framing added to the Seen-in.
- concepts/mysql-semi-sync-replication — extended. Semi-sync as failover-enabler framing (any replica promotable immediately) added to the Seen-in.
- concepts/query-buffering-cutover — extended. Query-buffering-as-static-stability-for-in-flight-operations framing added to the Seen-in.
- concepts/feature-flag — extended. Feature-flag-as-per-tenant- blast-radius-cap framing added to the Seen-in.
- concepts/blast-radius — extended. Per-database progressive delivery canonicalised as a blast-radius-cap mechanism.
- concepts/control-plane-impact-without-data-plane-impact — extended. Docker-registry failure as worked example.
Patterns¶
- patterns/always-be-failing-over-drill — new. Turn failover into a tested path by exercising it per-weekly-shipping-cycle on every customer database.
- patterns/multi-az-vitess-cluster — extended. Minimum-2-replicas datum + automatic-failover framing added.
- patterns/cross-dc-semi-sync-for-durability — extended. Cross-referenced as the durability substrate that makes patterns/always-be-failing-over-drill safe.
- patterns/direct-attached-nvme-with-replication — extended. Metal-specific surge-then-destroy VM-failure recovery (since NVMe is physically attached) vs EBS remount.
- patterns/shared-nothing-storage-topology — extended. Principles-layer framing — isolation applied at the storage substrate.
Operational numbers¶
- Minimum 2 replicas per cluster. (Verbatim: "a primary instance and a minimum of two replicas".)
- 3 availability zones minimum. ("Instances evenly distributed across three availability zones".)
- Weekly failover cadence on every customer database. ("Exercise this ability every week on every customer database as we ship changes".)
- Week-minimum soak period between dev-branch and production-branch rollouts. ("wait a week or more before shipping those same changes to production branches".)
- 1–2 customers as typical blast radius of a Vitess / operator bug, per the feature-flag-per-database rollout discipline.
Caveats¶
- Principles essay, not architecture disclosure. The post names mechanisms (semi-sync, query buffering, feature flags, Vitess Operator, EBS remount) but doesn't walk any of them in depth — the canonical mechanism disclosures live on sibling posts (Noach's semi-sync post, Morrison II's replication post, Dicken's Metal posts, Sougoumarane's consensus series).
- No production numbers beyond the five bulleted above. No MTTR, no availability measurements, no customer- retention data, no incident retrospective with worked example.
- No quantification of "rarely". "A bug in Vitess or the PlanetScale Kubernetes operator rarely impacts more than 1-2 customers" and "very rarely does" for infrastructure changes — no incidence rates, no MTBF.
- Semi-sync timeout behaviour elided. The post frames semi-sync as the substrate for always-be-failing-over but doesn't discuss timeout fallback (concepts/semi-sync-timeout-fallback) — Noach's 2026-04 semi-sync post covers that gap.
- Progressive-delivery mechanism hand-waved. "feature flags" named without naming the feature-flag system, the rollout-decision machinery, or the telemetry that gates progression. concepts/feature-flag page covers the general primitive; PlanetScale-specific rollout tooling is not disclosed.
- Regional fault tolerance is opt-in. "Enterprise customers have the ability to initiate a failover to one of their read-only regions" — not automatic, not default, not available to lower tiers. The post doesn't explain the operator-initiate-vs-automatic decision (presumably to avoid split-brain risk when the two regions can't be mutually fenced).
- Control plane can eat its own dog food safely. The post names that "the control plane uses a PlanetScale database to store customer and database metadata" — a circular dependency that is only safe because the data plane is designed to survive control- plane failure. The post doesn't walk through what happens at control-plane boot if the metadata-storing database is in a failure state; that case is presumably handled by the same data-plane-survives-control-plane principle recursively.
- Docker-registry example is hypothetical. The post frames "a hypothetical failure in one of our cloud providers' Docker registry services" — not a narrated retrospective. (The AWS us-east-1 2025-10-20 incident retrospective on PlanetScale's blog is the narrated companion piece; see sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20.)
- No cost framing. Overprovisioning + 3-AZ + 2-replica- minimum + weekly-failover-cadence have real cost implications that the principles essay doesn't quantify — the trade-off between reliability and cost is assumed-away.
- Max Englander's first wiki ingest — new byline for PlanetScale on the wiki (joining Dicken / Lambert / Noach / Barnett / Francis / Coutermarsh / Morrison II / Reyes / Van Wiggeren / Martí / Lord / Gangal / Sigireddi / Hazen / Gupta / Guevara / Raju / Taylor / Sougoumarane / Berquist / van Dijk / Longoria / Stojan / Lien / Gage / Gomez / Coutermarsh / Griggs / Ekechukwu / Kukic / Murty / Robenolt). No prior Englander posts on the wiki to cross-reference.
Source¶
- Original: https://planetscale.com/blog/the-principles-of-extreme-fault-tolerance
- Raw markdown:
raw/planetscale/2026-04-21-the-principles-of-extreme-fault-tolerance-a9a2f30f.md
Related¶
- systems/planetscale — the product this post is about
- systems/planetscale-metal — local-NVMe variant with its own failure-mode recovery
- systems/planetscale-portals — read-only regions + regional fault tolerance
- systems/vitess-operator — the operator executing automatic failover
- concepts/control-plane-data-plane-separation — the architectural shape emerging from the principles
- concepts/static-stability — principle #3, canonicalised by this post
- concepts/isolation-as-fault-tolerance-principle — principle #1, canonicalised by this post
- concepts/always-be-failing-over — the reliability-process canonicalisation
- concepts/progressive-delivery-per-database — the fleet-rollout discipline
- patterns/always-be-failing-over-drill — the new pattern this post canonicalises
- patterns/multi-az-vitess-cluster — the concrete architectural embodiment
- patterns/cross-dc-semi-sync-for-durability — the durability substrate that makes weekly failovers safe
- sources/2026-04-21-planetscale-the-operational-relational-schema-paradigm — Noach's parallel principles essay on the schema-change axis
- sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains — Noach's semi-sync mechanism deep-dive
- sources/2026-04-21-planetscale-mysql-replication-best-practices-and-considerations — Morrison II's replication-topology field manual
- sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — narrated incident-retrospective companion piece
- companies/planetscale