CONCEPT

Always be failing over¶

Definition¶

Always be failing over is the reliability-process principle that failover is exercised routinely, not reserved for emergencies. The failover code path is kept hot — continuously tested, continuously relied on for operational flow — so that when an unplanned failure forces a failover, the code path it takes is one that has been exercised within the last week rather than one dormant since the last incident.

Max Englander's canonical framing ():

"Very mature ability to fail over from a failing database primary to a healthy replica. Exercise this ability every week on every customer database as we ship changes. In the event of failing hardware or a network failure — fairly common in a big system running on the cloud — we automatically and aggressively fail over. Query buffering minimizes or eliminates disruption during failovers."

The load-bearing operational datum: every week on every customer database, not a percentage, not a canary fleet, not a test cluster — the full production customer fleet. Shipping new software versions uses failover as the roll mechanism; every weekly ship cycle routinely tests every customer's failover path.

Why exercise routinely¶

A failover code path that is only used during emergencies has three compounding problems:

Calcification. Code that isn't exercised drifts out of sync with the rest of the system — API changes, dependency updates, monitoring changes, runbook drift.
Fear of invocation. An operator looking at an unexercised failover path may hesitate to invoke it during an incident because "we haven't done this in six months". Hesitation is measured in minutes of user-facing downtime.
Undiscovered regressions. Failover typically depends on many orthogonal subsystems (replication state, topology server, query router, connection management, monitoring). A change to any one can break failover without breaking normal operation, and the breakage is invisible until the next failover.

Routine exercise addresses all three: the path stays synchronised with current code, operators are comfortable invoking it, and regressions surface weekly on routine ship cycles instead of monthly or annually on incidents.

Substrate requirements¶

Always-be-failing-over only works on top of substrates that make failover safe:

Semi-sync replication (or Postgres synchronous commits) so that any replica can be promoted immediately without data-loss risk. Englander's verbatim framing: "Enables us to treat replicas as potential primaries, and fail over to them immediately as needed." Without semi-sync, every failover is a data-loss gamble; weekly-exercising a data-loss gamble is irresponsible.
Query buffering at the proxy layer so in-flight queries survive the topology change without the client seeing a connection error. Canonicalised in Vitess's graceful leader demotion flow.
Automated failover orchestration (Vitess Operator, VTOrc) so no human is in the critical path — failover can happen during engineer-off-hours without paging.
Pre-provisioned replica capacity (static stability) so the promoted replica doesn't have to wait for capacity-provisioning to finish.

Absent any of these, "always be failing over" becomes "always be risking data loss" or "always be paging someone at 2am".

Distinguished from adjacent concepts¶

vs chaos engineering. Chaos engineering injects unexpected faults (random instance termination, network partitions, disk fills) to verify the system survives. Always-be-failing-over exercises a specific expected fault (primary replacement) on a predictable schedule (every ship cycle). The two compose — chaos engineering verifies surprises are survivable; always-be-failing-over verifies the planned failover path is still healthy.
vs Netflix Simian Army. Same altitude as chaos engineering. Netflix's Chaos Monkey injects random instance termination; PlanetScale's "always be failing over" exercises every customer database's failover deliberately. Netflix: sampled random; PlanetScale: complete deterministic.
vs blue-green deployment. Blue-green swaps a whole environment; always-be-failing-over swaps one role (primary → replica) within a single cluster. The ship-as-failover pattern at PlanetScale uses always- be-failing-over as the mechanism to roll a new version onto the primary — replicas are upgraded first, then one is promoted, which makes the upgrade ship and the failover exercise the same event.

Sibling: automatic + aggressive¶

Englander pairs the weekly deliberate exercise with automatic + aggressive reactive failover: "In the event of failing hardware or a network failure — fairly common in a big system running on the cloud — we automatically and aggressively fail over." The two modes reinforce each other: the deliberate mode keeps the path hot, the reactive mode uses the path constantly because cloud hardware fails constantly, so the deliberate-mode cadence is a lower-bound on exercise frequency, not an upper bound.

Seen in¶

**** — canonical framing; named as one of three reliability processes (alongside synchronous replication and progressive delivery). Canonicalised as patterns/always-be-failing-over-drill.

concepts/mysql-semi-sync-replication — substrate that makes the drill safe
concepts/query-buffering-cutover — substrate that makes the drill invisible to clients
concepts/static-stability — principle enabling the pre-provisioned headroom failover lands on
concepts/chaos-engineering — adjacent discipline
patterns/always-be-failing-over-drill — canonical pattern embodying this concept
patterns/continuous-fault-injection-in-production — Netflix Simian Army sibling discipline
patterns/graceful-leader-demotion — the Vitess mechanism the drill rides on
systems/vitess-operator — the operator executing the drill