Skip to content

PATTERN Cited by 1 source

Always-be-failing-over drill

Problem

A failover code path that is only invoked during actual incidents accumulates three failure modes silently:

  1. Calcification — the path drifts out of sync with the rest of the system as dependencies, APIs, and runbooks evolve around it.
  2. Fear of invocation — operators hesitate to use a path they haven't exercised recently, adding minutes of user-facing downtime to every incident.
  3. Undiscovered regressions — failover depends on many orthogonal subsystems (replication, topology server, query router, connection management, monitoring). A change to any one can break failover without breaking normal operation, invisible until the next real failover.

At fleet scale, a database vendor managing hundreds of thousands of clusters compounds this: actual failures happen continuously, but any specific cluster's failover path may be dormant for months.

Solution

Exercise the failover path routinely, on every cluster, using the shipping cadence itself as the drill mechanism. Turn the ship-a-new-version event and the failover event into the same event: upgrade replicas first, then promote one, which simultaneously ships the new version and exercises failover. The schedule forces weekly (or faster) exercise on every customer cluster.

Max Englander's verbatim framing (sources/2026-04-21-planetscale-the-principles-of-extreme-fault-tolerance):

"Very mature ability to fail over from a failing database primary to a healthy replica. Exercise this ability every week on every customer database as we ship changes. In the event of failing hardware or a network failure — fairly common in a big system running on the cloud — we automatically and aggressively fail over. Query buffering minimizes or eliminates disruption during failovers."

The pattern comprises three composed elements:

1. Release-as-failover

The new software version (MySQL binary, Vitess operator, cluster config) is rolled out via the failover mechanism itself. Upgrade the replicas first (so they're running the new version); promote one of them to primary; the old primary, now stale, is re-imaged and rejoins as a replica. The act of shipping IS the act of failing-over.

2. Weekly cadence, full fleet coverage

Every ship cycle exercises every customer cluster — not a canary subset, not a test fleet. At PlanetScale's scale this means "every week on every customer database". The full-fleet coverage catches regressions that only surface on specific customer configurations (e.g., a workload-specific bug).

3. Safe substrate

The drill is only safe because three substrates preserve correctness across the topology change:

  • Semi-sync replication so the promoted replica has all acknowledged writes. Without it, weekly drills would have weekly data loss.
  • Query buffering at the VTGate proxy tier so in-flight queries survive the promotion without client-visible errors.
  • Automated orchestration via Vitess Operator so no human is in the loop — operators aren't paged for routine drills.

Applicability

  • Database-as-a-service vendors managing large fleets of isolated clusters (PlanetScale, managed-Postgres, managed-Kafka, managed-MongoDB).
  • Active/passive replication topologies where one replica is the write target and others are promotable.
  • Systems with automated-failover orchestrators — humans can't drill hundreds of thousands of clusters weekly.
  • Systems with low-disruption failover substrate — query buffering, connection preservation, fast reconnect.

Known uses

Distinguished from adjacent patterns

  • vs Simian Army continuous fault injection. Netflix Simian Army injects random unexpected faults on a sampled fleet; ABFO drill exercises a specific expected fault (primary replacement) on the full fleet. The two are complementary — Simian Army verifies the system survives surprises; ABFO drill verifies the planned failover path is still healthy.
  • vs generic game days / chaos days. Chaos days schedule failover exercises periodically (monthly, quarterly); ABFO drill exercises continuously (weekly or faster). The cadence difference is load-bearing — code that ships weekly via failover is tested weekly; code that ships via a separate mechanism drifts out of sync with the failover path.
  • vs blue-green deployment. Blue-green swaps whole environments; ABFO drill swaps a single role (primary ↔ replica) within a cluster. Blue-green costs 2× steady- state capacity; ABFO drill uses the existing replica headroom.
  • vs Vitess planned- reparent. Planned-reparent is the mechanism ABFO rides on. ABFO is the discipline of invoking planned-reparent on schedule rather than only reactively.

Trade-offs

  • Requires the substrate investment first. A well-worn failover path still loses writes if semi-sync isn't in place, and still disrupts clients if query buffering isn't in place. Without the substrate, ABFO drill is reckless.
  • Adds a weekly cost to every cluster. Each failover is a brief disruption (even if query-buffered) + the ambient cost of running a replica at primary-sized capacity. Absorbed by static stability's overprovisioning principle.
  • Requires fleet-disaggregated telemetry to surface per-customer regressions that only appear on specific workloads. Aggregate-fleet metrics hide a one-customer broken failover in the noise.
  • Requires per-database progressive delivery for the release-as-failover composition — weekly failover of every customer to the same buggy version is worse than never failing over. The per-database flag-gate caps the blast radius of a bad ship to 1-2 customers.

Seen in

Last updated · 470 distilled / 1,213 read