PATTERN Cited by 1 source
Zero-downtime reparent on degradation¶
Problem¶
When the health monitor (see patterns/automated-volume-health-monitoring) classifies an EBS volume as degraded, the database process attached to that volume is also degraded — read/write latency is customer- visible. The mitigation has to:
- Drain customer traffic off the sick node in seconds, because every second of degraded operation is 500-on-user.
- Maintain data safety — no lost commits, no torn writes, no split-brain with the sick node still accepting writes.
- Restore fleet capacity — bring up a replacement volume on a healthy node so the cluster goes back to full replica count.
- Require no human in the loop, because at fleet scale these events happen multiple times per day.
Manual failover playbooks don't meet (1) or (4); dumb instance-level health checks don't meet (2).
Solution¶
Automated reparent in seconds on a topology-aware cluster, triggered by the health monitor. The steps:
- Elect a new primary (or promote a replica) on a node with a healthy EBS volume.
- Flip traffic at the cluster topology layer — in Vitess,
this is
PlannedReparentShardorEmergencyReparentShard. - Fence the sick node so it stops accepting writes, avoiding split-brain if the degraded volume recovers mid-reparent.
- Provision replacement volume on a fresh node to restore replica count, and let the standard replication protocol catch up the replacement.
- Retire the degraded volume once confirmed drained.
On PlanetScale's production fleet:
When we detect that an EBS volume is in a degraded state using these heuristics, we can perform a zero-downtime reparent in seconds to another node in the cluster, and automatically bring up a replacement volume.
"Zero-downtime" is explicit: no application-visible write failures during the reparent. "In seconds" clamps the impact window.
Structure¶
T=0 T=Δ detect T=Δ+r reparent T=Δ+r+p provision
─────── ────────────── ──────────────── ──────────────────
healthy volume-health cluster reparent replacement
stable monitor → promote replica volume up
classifies → fence primary → replication
degraded → traffic moves catches up
→ cluster at
full replica
count
↑ ↑
customer-visible p99 customer-visible p99
normal normal (on new primary)
Customer-visible impact window = Δ (detection) + r (reparent). Both are measured in seconds.
When to use¶
- Sharded replicated cluster. Any architecture where a replica is already live, caught up, and ready to promote — Vitess/MySQL, Postgres streaming replication, Cassandra, MongoDB replica sets, CockroachDB. Reparent is cheap.
- Substrate with variance-floor failure mode. EBS + gp3 / io2 on OLTP. The pattern is overkill on storage substrates that don't produce frequent partial-failure events.
- Automated-health-monitoring pattern deployed. Reparent is the downstream action; detection is the upstream prerequisite.
When not to use¶
- Single-master topologies with no live replica. Reparent requires a replica to promote. Single-master workloads have to wait for a fresh node boot + catch-up.
- Workloads where reparent itself is expensive. If reparenting disrupts the application (topology pinning, connection-pinning to primary) more than the degradation event, the fix is worse than the disease.
- Already on a storage substrate without variance floor. On direct-attached NVMe with replication, the trigger rarely fires; running unused reparent machinery is cheap, but it's operationally quieter if the health-monitor thresholds rarely trip. See systems/planetscale-metal + patterns/shared-nothing-storage-topology.
Trade-offs¶
- Reparent adds cluster churn. Each event = topology change + connection reset + a period of lag on the new primary. Keep event rate low by tuning health-monitor thresholds.
- Correlated failure breaks the pattern. If the reparent target is in the same AZ as the degraded primary and the event is a correlated-AZ-failure (see concepts/correlated-ebs-failure), the replica is also degraded. Mitigation: cross-AZ replication, or structural fix via patterns/shared-nothing-storage-topology.
- Detect-then-act latency is a floor. The pattern can only shrink the impact window; it cannot eliminate it. The degradation event is already happening when the monitor classifies, so Δ seconds of degraded traffic always leak through.
- Requires live, caught-up replicas at all times. The replication-lag SLO becomes part of the reparent SLO.
Relationship to structural fix¶
Reparent-on-degradation + automated volume health monitoring is the customer-side operational mitigation for running OLTP on network-attached block storage. The structural fix — used by systems/planetscale-metal — is to put each database instance on direct-attached NVMe, so that one node's hardware failure doesn't affect others. Reparent is still needed on Metal for node failure, but the event rate drops dramatically because the variance floor does.
Seen in¶
- sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs — canonical wiki instance. PlanetScale's production deployment across millions of EBS volumes; reparent in seconds, replacement volume auto-provisioned, majority of events resolved before users notice.
Related¶
- patterns/automated-volume-health-monitoring
- patterns/shared-nothing-storage-topology
- patterns/direct-attached-nvme-with-replication
- patterns/automatic-provider-failover
- concepts/slow-is-failure
- concepts/performance-variance-degradation
- concepts/correlated-ebs-failure
- concepts/blast-radius-multiplier-at-fleet-scale
- systems/aws-ebs
- systems/planetscale-metal
- systems/vitess