CONCEPT Cited by 1 source
Unplanned failover playbook¶
Definition¶
An unplanned-failover playbook is the operational runbook for promoting a replica to primary when the current primary has crashed, become unreachable, or otherwise failed unexpectedly. Distinct from planned failover (software rollout, maintenance window) which can use graceful leader demotion with query buffering and zero application errors, unplanned failover accepts a brief window of write unavailability in exchange for correctness — specifically, for avoiding split-brain when the downed primary returns.
The canonical framing — Morrison on PlanetScale¶
Brian Morrison II's 2023-11-15 best-practices post canonicalises the four-step procedure verbatim:
"One of the major benefits of using replication is the increase in resiliency by having more than one server containing your data online at any given time. Your team should have a good strategy ready in case the primary data source fails. The following is an example of what an unplanned failover might look like:
1. Take measures to ensure the downed source won't come back online. This could cause replication issues if it happens unexpectedly. 2. Identify the replica you want to choose as the new source and unset the
read_onlyoption. If semi-sync is used, this would be the replica you've configured with the plugin along with the source. 3. Update your application to direct queries to the newly promoted source. 4. Update the other replicas to start replicating from the new source." (Source: sources/2026-04-21-planetscale-mysql-replication-best-practices-and-considerations)
The four steps, annotated¶
Step 1 — Fence the downed primary¶
"Take measures to ensure the downed source won't come back online. This could cause replication issues if it happens unexpectedly."
Why it's first: if the downed primary recovers after step 2 or 3, you have two primaries accepting writes to the same dataset — split-brain. Every write on either side is now suspect. Recovery requires choosing one side's writes and discarding the other's — data loss with no algorithmic way to pick the right side.
Mechanisms: power off the host; deny network traffic (security group / firewall rule); remove from the load balancer; stop the mysqld process if reachable; detach the EBS volume. Any mechanism that guarantees the old primary cannot accept a write is valid — the ordering is load-bearing but the implementation is not.
Step 2 — Promote a replica¶
"Identify the replica you want to choose as the new source and unset the read_only option. If semi-sync is used, this would be the replica you've configured with the plugin along with the source."
Candidate selection is the hard part of this step, and the one that motivates the mixed sync + async replication topology. In a pure-async cluster, replicas drift by varying amounts; picking the "furthest-ahead" requires querying each replica's GTID position / binlog position. In a mixed-mode cluster with one semi-sync-flagged replica, that replica is the answer by construction — it has every transaction the primary acknowledged, because semi-sync blocked the primary's ack until the relay-log persisted.
Mechanism: SET GLOBAL read_only = OFF on the chosen replica + SET GLOBAL super_read_only = OFF in modern MySQL (the super_read_only flag prevents even SUPER-privileged users from writing and is the stricter version).
Step 3 — Re-point the application¶
"Update your application to direct queries to the newly promoted source."
The application-layer cutover. Morrison's post is at the MySQL-vanilla altitude and elides the hard parts:
- Connection pooling — existing connections to the old primary must drain or reset; connection poolers (PgBouncer, systems/hyperdrive, ProxySQL, Vitess
vtgate) are the natural enforcement point. - DNS vs VIP vs proxy — a DNS-based primary endpoint has TTL-bounded cutover latency; a Virtual IP move is faster but platform-specific; a proxy tier (Vitess, RDS Proxy) can flip atomically at the proxy layer with zero app-side change.
- Write-side unavailability window — the brief period between "old primary fenced" and "new primary accepting writes + app aware of it" is write-unavailable. Typical: seconds to tens of seconds.
Managed substrates (Vitess via systems/vtorc, MySQL orchestrator via systems/orchestrator, AWS RDS) automate steps 1-3 as a single operation.
Step 4 — Re-point remaining replicas¶
"Update the other replicas to start replicating from the new source."
Every surviving replica must be told the new upstream:
STOP SLAVE;
CHANGE MASTER TO
MASTER_HOST='new-primary.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='...',
MASTER_AUTO_POSITION=1; -- GTID-based
START SLAVE;
(Or CHANGE REPLICATION SOURCE TO ... in MySQL 8+.)
GTIDs make this safe: with GTID auto-positioning, each replica's gtid_executed set tells the new primary exactly which transactions to stream — no risk of skipping or re-applying. Without GTIDs (file+position replication), the operator must know the exact binlog file + offset on the new primary corresponding to the replica's progress — fragile and error-prone, especially under crash-and-restore.
The ordering is load-bearing¶
Step 1 must precede step 2: if you promote a replica while the old primary is still reachable, both will accept writes. Step 2 must precede step 3: routing the app to a replica that hasn't unset read_only yields write errors. Step 4 can happen in parallel with step 3 (replicas don't affect write availability) but typically runs after step 3 in most runbooks to minimise load on the new primary during the chaotic application-cutover moment.
Contrast with planned failover¶
Planned failover — software rollout, scheduled maintenance — uses a completely different path: patterns/graceful-leader-demotion (the Vitess PRS path, canonical via sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation):
- Ask the current primary to step down (it completes in-flight transactions during a lameduck drain).
- Buffer new writes at the proxy tier during the drain.
- Promote the new primary.
- Flush buffered writes to the new primary.
Application sees no errors. The two paths are duals on the revoke-and-establish axis:
| Axis | Planned | Unplanned |
|---|---|---|
| Revocation | Graceful demotion (ask nicely) | Fence (forcibly isolate) |
| Drain | Lameduck + query buffering | Accept brief write unavailability |
| Application visibility | Zero errors | Brief error/retry window |
| Common case | Daily (software rollouts) | Monthly-or-less (crashes) |
| Optimised for | UX continuity | Correctness under hostile failure |
Seen in¶
- sources/2026-04-21-planetscale-mysql-replication-best-practices-and-considerations — canonical wiki four-step procedure with the ordering rationale (fence first to prevent split-brain; semi-sync replica is the safe promotion candidate).
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation — companion source on the planned-failover (graceful demotion) path; this page's dual.
Related¶
- concepts/active-passive-replication — the topology this playbook applies to.
- concepts/mysql-semi-sync-replication — the posture that makes candidate-selection deterministic.
- concepts/split-brain — the failure mode step 1 prevents.
- concepts/leader-revocation + concepts/leader-establishment — the generic revocation/establishment primitives this playbook composes.
- patterns/graceful-leader-demotion — the planned-failover dual.
- patterns/mixed-sync-replication-topology — the topology that makes step 2 unambiguous.
- patterns/zero-downtime-reparent-on-degradation — the managed-substrate version (Vitess ERS /
vtorc). - systems/vtorc + systems/orchestrator — automators that run this playbook for you.
- systems/mysql + companies/planetscale — substrate and first-party voice.