PATTERN Cited by 1 source

Zonal reparenting to healthy AZ¶

Problem¶

A partial network partition between availability zones has left some database primaries connectable from the Internet but unable to communicate with their replicas, the application, or both across AZ boundaries. Automatic failover detection is ambiguous — the primary looks up from some vantage points and down from others — so the automated VTOrc-style health-check-driven failover may not fire, or may fire inconsistently.

Meanwhile, customer queries routed to an AZ whose primary is unreachable across the partition fail. The goal is to move each affected primary to an AZ where it has reliable cross-AZ connectivity — ideally to the AZ colocated with the customer's application.

Solution¶

Operator-driven, per-cluster reparent to an AZ known to be healthier, using the Vitess PlannedReparentShard (or EmergencyReparentShard) mechanism via an operator interface. Selection of the target AZ is informed by:

Which AZs are reachable from the customer's application. If the app runs in us-east-1a and us-east-1b is partition-isolated, reparent the primary to us-east-1a.
Which AZ has the fewest observed partition symptoms. Connectivity matrices built from replica-to-primary health signals and per-AZ error rates.
Which AZ is colocated with the largest share of customer traffic. Minimises cross-AZ hops during the partition.

Verbatim from PlanetScale's 2025-10-20 incident post:

Where possible, we manually sent reparent requests to move primary databases to availability zones known to be healthier or known to be colocated with the customer's application.

Two decision criteria disclosed: healthier (AZ-level connectivity signal) and colocated with the customer's application (proximity signal). The post frames this as "where possible" — the mechanism wasn't universally applicable, because the partial partition didn't always leave a clean "healthier" AZ to reparent into.

Mechanics¶

Build a per-AZ health picture. From the existing monitoring fabric (VTOrc health observations, per-tablet connectivity metrics, per-query error rates by AZ) construct a signal that ranks AZs by current health for the cluster in question.
Check application-side locality. Where is the customer's application running? This usually requires operator-level knowledge of the customer's topology, not automated signals.
Initiate a planned reparent. vtctld / PlannedReparentShard against the selected new-primary replica. Vitess handles the demotion-of-old-primary, promotion-of-new-primary, query-buffering-during-cutover sequence.
Fall back to EmergencyReparentShard when the old primary is genuinely unreachable. PRS requires the old primary to be reachable for clean demotion; during a partial partition, ERS may be the only option.

The move is reactive and operator-driven, not automatic — because automatic detection of which AZ is healthier requires per-path visibility most health systems don't have, and the wrong automated reparent during a partial partition could make things worse.

When this is right¶

The partition is persistent enough to justify a reparent. Flapping partitions that heal in seconds should not trigger operator reparents; the cost of a primary move exceeds the brief unavailability.
A clean target AZ exists. If all AZs show some partition symptoms, the reparent may just move the failure surface rather than escape it.
The operator has enough visibility to choose the target. Per-AZ health signals + application-locality signals need to be available.
The cluster uses a topology amenable to reparent. Vitess / Aurora / Orchestrator-style topologies support planned reparents natively; simpler primary-replica setups may not.

When this is wrong¶

The partition is fleet-wide. If every AZ is partially partitioned from every other, reparent doesn't help — move the primary, and the new AZ has its own partial-partition problems.
Automatic failover is already reparent-ing well. If the automated health-check system correctly identifies the right AZ, don't second-guess it; operator intervention adds risk.
The customer's application is multi-AZ-balanced. If the customer runs in all AZs equally, there's no "colocated" target to prefer — pure health signal is the only criterion.

Composition with automation boundaries¶

Zonal reparenting to healthy AZ is an escape hatch from automatic failover during partial partitions. The automated stack (systems/vtorc + health-check-driven PRS/ERS) handles the "primary is clearly broken" case well; partial partitions produce ambiguous health signals that may not trigger the automation, or may trigger it inconsistently per node.

PlanetScale's full incident playbook composes: - Automatic failover handles unambiguous primary failure. - Operator zonal reparent handles partial-partition ambiguity. - Multi-AZ Vitess cluster topology provides the replica inventory the reparent target has to come from. - Weekly failover drill is the substrate that makes both the automatic and the manual reparent well-tested.

Recovery quirk: stuck processes after healing¶

After the partition heals and traffic returns to normal, expect some processes to need manual restart — the partial-partition page notes PlanetScale's specific observation: "Once the network partitions healed, we found a small number of processes (PlanetScale's edge load balancer as well as vtgate) which were not able to recover on their own due to the way they experienced the network partition. We restarted these and restored service." Zonal reparent plus post-heal restart sweep is the complete response.

Seen in¶

sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki application. Phase 2 of the 2025-10-20 AWS us-east-1 incident, during the ~14:30–19:30 UTC network-partition window. Two decision criteria disclosed: "healthier AZ" and "colocated with customer application." Executed as operator- driven manual reparent requests, not automatic failover. No disclosure on how many reparents were performed or on per-customer impact duration; the post explicitly caveats "where possible" — partial partitions sometimes left no clean healthy-AZ target.

concepts/partial-network-partition — the fault class this pattern responds to.
concepts/availability-zone-balance — the placement property that makes reparent-targets available.
concepts/blast-radius — zonal reparent bounds blast radius by moving the primary away from the affected AZ.
systems/vitess, systems/vttablet, systems/vtorc — the runtime that supports PRS/ERS natively.
systems/planetscale — the product that operationalises this pattern during partial-partition incidents.
patterns/multi-az-vitess-cluster — the topology that gives the reparent a healthy replica to promote.
patterns/always-be-failing-over-drill — the weekly- failover discipline that keeps reparent well-tested.