CONCEPT

Isolation as fault-tolerance principle¶

Definition¶

Isolation as a reliability principle means systems are made from parts that are as physically and logically independent as possible, such that failures in one part do not cascade into failures in an independent part, and parts in the critical path have as few dependencies as possible.

Max Englander's canonical framing ():

"Systems are made from parts that are as physically and logically independent as possible. Failures in one part do not cascade into failures in an independent part. Parts in the critical path have as few dependencies as possible."

And on redundancy, which Englander names as the principle that co-applies with isolation:

"Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other."

The bolded clause is load-bearing: redundancy without isolation between copies is not fault tolerance — if the copies share a failure domain (same rack, same AZ, same power distribution, same library dependency, same deployment pipeline), they fail together.

Three axes of isolation¶

Physical isolation¶

Copies placed across distinct physical failure domains — different machines, racks, AZs, regions, power feeds, network links. Canonical application: primary + ≥2 replicas across 3 AZs. Extends to different cloud providers for the strongest isolation (PlanetScale runs on both AWS and GCP).

Logical isolation¶

Copies don't share a software-level failure domain — no shared memory, no shared process, no shared library version that could cascade a bug across copies. The counter-examples are load-bearing:

Two replicas on the same MySQL binary version share the "MySQL bug in that version" failure domain. Mitigated by progressive-delivery per database (concepts/progressive-delivery-per-database).
Two processes on the same VM share the "VM crashes" failure domain. Mitigated by physical isolation across VMs.
Two systems depending on the same configuration service share the "config-service outage" failure domain. Mitigated by data-plane caches last-known-good control-plane state.

Dependency minimisation in the critical path¶

The critical path has as few dependencies as possible. Each dependency is a potential failure mode — adding a dependency to the critical path subtracts availability from it. PlanetScale's framing of the data plane:

"The most critical plane, with fewer dependencies than the control plane. Does not depend on the control plane."

This inverts the naive "more-critical = more-redundant" intuition with "more-critical = fewer-dependencies". Redundancy only helps for failure modes where the redundant copies aren't also taking the same dependency; shedding a dependency is strictly more reliable than redundifying it.

Composition with redundancy¶

Isolation + redundancy together produce fault tolerance:

Redundancy alone: N copies of a part, all sharing a single failure domain. N copies fail together; no benefit.
Isolation alone: one part with few dependencies. If the part itself fails (without redundant copies), the system fails.
Both: N copies across N distinct failure domains, each with minimal dependencies. A failure in one copy's failure domain — or in one of its dependencies — does not affect the other copies.

Englander's formalism: "Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other."

Applied to PlanetScale¶

Englander's essay canonicalises the principles-to-architecture mapping verbatim:

Control plane / data plane split — the data plane (query serving) has "extremely few dependencies" so its criticality is matched by dependency minimisation. The control plane (billing, DB creation, metadata) is allowed more dependencies (including a PlanetScale database for its own metadata — a deliberate circular dependency safe only because the data plane survives control-plane failure). Canonicalised as concepts/control-plane-data-plane-separation.
Regional + zonal redundancy of both planes — not just the data plane. Both planes are multi-AZ.
Database clusters = primary + ≥2 replicas across 3 AZs — the concrete embodiment of multi-AZ Vitess cluster.
Static stability (concepts/static-stability) as the behaviour isolation + redundancy enables: "survive the isolation-broken-part by continuing on the other copies with their last-known-good state".

Seen in¶

**** — canonical verbatim framing as the first of three principles (alongside static stability and redundancy).
— Production test of the three isolation axes in a real upstream incident. Phase 1 of the 2025-10-20 AWS us-east-1 incident exercised all three: (1) Physical isolation — the multi-AZ replica topology held; nothing failed because the fault was upstream, not zone-local. (2) Logical isolation — the data plane's already-cached credentials and routing state meant no shared runtime dependency on the broken control-plane chain (new-branch service → secret distribution → S3 → STS → DynamoDB). Different-shape logical-isolation proof than the typical "different library version per replica" example. (3) Dependency minimisation in the critical path — canonically validated: the data plane's zero runtime calls into the dying control-plane chain is what let database branches stay up ("Throughout this period, no database branches lost capacity or connectivity"). Phase 2 showed the converse: data-plane workflows that do share a dependency surface with the control plane (backup launches an EC2 instance just like branch creation) become part of the blast radius — see concepts/ec2-launch-failure-mode. Canonical demonstration that the isolation principle protects against the failure modes you've actually isolated from, not the ones you've only redundified against.

concepts/static-stability — principle #3 of the trio
concepts/blast-radius — isolation's operational measure
concepts/control-plane-data-plane-separation — architectural shape exploiting isolation
concepts/slow-is-failure — why a slow dependency counts as a failure domain too
patterns/multi-az-vitess-cluster — physical isolation applied at cluster altitude
patterns/shared-nothing-storage-topology — physical isolation applied at storage altitude