Skip to content

PATTERN Cited by 1 source

Zero-downtime reparent on degradation

Problem

When the health monitor (see patterns/automated-volume-health-monitoring) classifies an EBS volume as degraded, the database process attached to that volume is also degraded — read/write latency is customer- visible. The mitigation has to:

  1. Drain customer traffic off the sick node in seconds, because every second of degraded operation is 500-on-user.
  2. Maintain data safety — no lost commits, no torn writes, no split-brain with the sick node still accepting writes.
  3. Restore fleet capacity — bring up a replacement volume on a healthy node so the cluster goes back to full replica count.
  4. Require no human in the loop, because at fleet scale these events happen multiple times per day.

Manual failover playbooks don't meet (1) or (4); dumb instance-level health checks don't meet (2).

Solution

Automated reparent in seconds on a topology-aware cluster, triggered by the health monitor. The steps:

  1. Elect a new primary (or promote a replica) on a node with a healthy EBS volume.
  2. Flip traffic at the cluster topology layer — in Vitess, this is PlannedReparentShard or EmergencyReparentShard.
  3. Fence the sick node so it stops accepting writes, avoiding split-brain if the degraded volume recovers mid-reparent.
  4. Provision replacement volume on a fresh node to restore replica count, and let the standard replication protocol catch up the replacement.
  5. Retire the degraded volume once confirmed drained.

On PlanetScale's production fleet:

When we detect that an EBS volume is in a degraded state using these heuristics, we can perform a zero-downtime reparent in seconds to another node in the cluster, and automatically bring up a replacement volume.

"Zero-downtime" is explicit: no application-visible write failures during the reparent. "In seconds" clamps the impact window.

Structure

  T=0     T=Δ detect      T=Δ+r reparent    T=Δ+r+p provision
  ─────── ────────────── ──────────────── ──────────────────
  healthy  volume-health  cluster reparent  replacement
  stable    monitor        → promote replica  volume up
             classifies     → fence primary    → replication
             degraded       → traffic moves     catches up
                                                → cluster at
                                                  full replica
                                                  count
   ↑                          ↑
   customer-visible p99       customer-visible p99
   normal                     normal (on new primary)

Customer-visible impact window = Δ (detection) + r (reparent). Both are measured in seconds.

When to use

  • Sharded replicated cluster. Any architecture where a replica is already live, caught up, and ready to promote — Vitess/MySQL, Postgres streaming replication, Cassandra, MongoDB replica sets, CockroachDB. Reparent is cheap.
  • Substrate with variance-floor failure mode. EBS + gp3 / io2 on OLTP. The pattern is overkill on storage substrates that don't produce frequent partial-failure events.
  • Automated-health-monitoring pattern deployed. Reparent is the downstream action; detection is the upstream prerequisite.

When not to use

  • Single-master topologies with no live replica. Reparent requires a replica to promote. Single-master workloads have to wait for a fresh node boot + catch-up.
  • Workloads where reparent itself is expensive. If reparenting disrupts the application (topology pinning, connection-pinning to primary) more than the degradation event, the fix is worse than the disease.
  • Already on a storage substrate without variance floor. On direct-attached NVMe with replication, the trigger rarely fires; running unused reparent machinery is cheap, but it's operationally quieter if the health-monitor thresholds rarely trip. See systems/planetscale-metal + patterns/shared-nothing-storage-topology.

Trade-offs

  • Reparent adds cluster churn. Each event = topology change + connection reset + a period of lag on the new primary. Keep event rate low by tuning health-monitor thresholds.
  • Correlated failure breaks the pattern. If the reparent target is in the same AZ as the degraded primary and the event is a correlated-AZ-failure (see concepts/correlated-ebs-failure), the replica is also degraded. Mitigation: cross-AZ replication, or structural fix via patterns/shared-nothing-storage-topology.
  • Detect-then-act latency is a floor. The pattern can only shrink the impact window; it cannot eliminate it. The degradation event is already happening when the monitor classifies, so Δ seconds of degraded traffic always leak through.
  • Requires live, caught-up replicas at all times. The replication-lag SLO becomes part of the reparent SLO.

Relationship to structural fix

Reparent-on-degradation + automated volume health monitoring is the customer-side operational mitigation for running OLTP on network-attached block storage. The structural fix — used by systems/planetscale-metal — is to put each database instance on direct-attached NVMe, so that one node's hardware failure doesn't affect others. Reparent is still needed on Metal for node failure, but the event rate drops dramatically because the variance floor does.

Seen in

Last updated · 319 distilled / 1,201 read