CONCEPT Cited by 1 source
Correlated EBS failure within an availability zone¶
Definition¶
Correlated EBS failure within an AZ is the observation that EBS volume performance degradation is not independent across volumes — a single underlying networking, control-plane, or storage-fabric event can push many volumes in the same availability zone into degraded-performance mode simultaneously.
"We also see these frequently as correlated failure inside of a single zone, even using io2 volumes." (Source: sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs; post includes a correlated-failure screenshot.)
Why it contradicts the naive design assumption¶
The default replication design for OLTP-on-AWS is:
- 3 volumes (1 primary + 2 replicas) inside an AZ for fast local replication and low write latency.
- Replicas on different volumes (implied different failure domains at the block-storage layer).
The naive assumption is that "different volumes" ≈ "independent failure". Correlated-AZ-failure breaks that assumption: a correlated degradation event hits all three volumes in the same AZ at once, so the primary plus both replicas drop into degraded mode together. Switching a query from a degraded primary to a degraded replica buys nothing.
Why io2 doesn't fix it¶
io2 is AWS's top-tier single-volume offering (4×–10× gp3 price, 99.999% durability SLA). Its guarantees cover single-volume durability and per-volume IOPS ceiling. It does not re-architect the shared storage-fabric path — the control plane, networking, and target adapters are still shared across the AZ. A fabric-layer event lands on io2 and gp3 alike.
"even using io2 volumes" is the load-bearing observation in the PlanetScale post. It rules out the "just pay for io2" escape hatch.
Structural implication¶
If replicas-on-different-volumes-in-same-AZ are not independent, then fault-isolation on network-attached block storage requires one of:
- Cross-AZ replication — pay the write latency tax for synchronous replicas in another AZ (classic Multi-AZ RDS).
- Cross-region replication — even more latency, operational complexity.
- Don't share the storage fabric. Use direct-attached NVMe per instance, replicate at the cluster application layer. See patterns/direct-attached-nvme-with-replication + patterns/shared-nothing-storage-topology + systems/planetscale-metal.
PlanetScale picks option 3 for Metal: each node has its own local drive; the only shared substrate between primary and replicas is the EC2 instance fleet + network, not a block-storage fabric.
Sibling primitives on the wiki¶
- concepts/correlated-failure — general concept; this page is the EBS-fabric-layer instance.
- concepts/blast-radius-multiplier-at-fleet-scale — fleet- scale probability multiplier. Correlated-AZ-failure is a positive correlation term that makes that multiplier worse.
- concepts/noisy-neighbor — the within-volume version (shared-fabric variance from a co-tenant workload).
- concepts/availability-zone-failure-drill — the AZ-wide full-outage framing; this page is the partial-failure analogue within AZ.
Seen in¶
- sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs — canonical wiki instance. Observed "frequently" across PlanetScale's production fleet, verbatim stated to hold on io2 as well as gp3. No cross-AZ / cross-region correlation figures disclosed; no per-AZ incidence rate.
Caveats¶
- PlanetScale doesn't publish a measured frequency or a cluster enumeration of the events. The screenshot is a single example. The qualitative claim — "we see these frequently" — is the only quantifier in the post.
- The correlation mechanism isn't identified. Candidates are control-plane failover, shared networking-gear failover, or storage-fabric hot-spotting. The 1–10-minute typical event duration PlanetScale reports is consistent with all three mechanisms.
Related¶
- concepts/correlated-failure
- concepts/performance-variance-degradation
- concepts/blast-radius-multiplier-at-fleet-scale
- concepts/slow-is-failure
- concepts/noisy-neighbor
- concepts/availability-zone-failure-drill
- systems/aws-ebs
- systems/planetscale-metal
- patterns/shared-nothing-storage-topology
- patterns/direct-attached-nvme-with-replication