Skip to content

CONCEPT Cited by 1 source

Isolation as fault-tolerance principle

Definition

Isolation as a reliability principle means systems are made from parts that are as physically and logically independent as possible, such that failures in one part do not cascade into failures in an independent part, and parts in the critical path have as few dependencies as possible.

Max Englander's canonical framing (sources/2026-04-21-planetscale-the-principles-of-extreme-fault-tolerance):

"Systems are made from parts that are as physically and logically independent as possible. Failures in one part do not cascade into failures in an independent part. Parts in the critical path have as few dependencies as possible."

And on redundancy, which Englander names as the principle that co-applies with isolation:

"Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other."

The bolded clause is load-bearing: redundancy without isolation between copies is not fault tolerance — if the copies share a failure domain (same rack, same AZ, same power distribution, same library dependency, same deployment pipeline), they fail together.

Three axes of isolation

Physical isolation

Copies placed across distinct physical failure domains — different machines, racks, AZs, regions, power feeds, network links. Canonical application: primary + ≥2 replicas across 3 AZs. Extends to different cloud providers for the strongest isolation (PlanetScale runs on both AWS and GCP).

Logical isolation

Copies don't share a software-level failure domain — no shared memory, no shared process, no shared library version that could cascade a bug across copies. The counter-examples are load-bearing:

  • Two replicas on the same MySQL binary version share the "MySQL bug in that version" failure domain. Mitigated by progressive-delivery per database (concepts/progressive-delivery-per-database).
  • Two processes on the same VM share the "VM crashes" failure domain. Mitigated by physical isolation across VMs.
  • Two systems depending on the same configuration service share the "config-service outage" failure domain. Mitigated by data-plane caches last-known-good control-plane state.

Dependency minimisation in the critical path

The critical path has as few dependencies as possible. Each dependency is a potential failure mode — adding a dependency to the critical path subtracts availability from it. PlanetScale's framing of the data plane:

"The most critical plane, with fewer dependencies than the control plane. Does not depend on the control plane."

This inverts the naive "more-critical = more-redundant" intuition with "more-critical = fewer-dependencies". Redundancy only helps for failure modes where the redundant copies aren't also taking the same dependency; shedding a dependency is strictly more reliable than redundifying it.

Composition with redundancy

Isolation + redundancy together produce fault tolerance:

  • Redundancy alone: N copies of a part, all sharing a single failure domain. N copies fail together; no benefit.
  • Isolation alone: one part with few dependencies. If the part itself fails (without redundant copies), the system fails.
  • Both: N copies across N distinct failure domains, each with minimal dependencies. A failure in one copy's failure domain — or in one of its dependencies — does not affect the other copies.

Englander's formalism: "Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other."

Applied to PlanetScale

Englander's essay canonicalises the principles-to-architecture mapping verbatim:

  • Control plane / data plane split — the data plane (query serving) has "extremely few dependencies" so its criticality is matched by dependency minimisation. The control plane (billing, DB creation, metadata) is allowed more dependencies (including a PlanetScale database for its own metadata — a deliberate circular dependency safe only because the data plane survives control-plane failure). Canonicalised as concepts/control-plane-data-plane-separation.
  • Regional + zonal redundancy of both planes — not just the data plane. Both planes are multi-AZ.
  • Database clusters = primary + ≥2 replicas across 3 AZs — the concrete embodiment of multi-AZ Vitess cluster.
  • Static stability (concepts/static-stability) as the behaviour isolation + redundancy enables: "survive the isolation-broken-part by continuing on the other copies with their last-known-good state".

Seen in

Last updated · 470 distilled / 1,213 read