CONCEPT

Correlated EBS failure within an availability zone¶

Definition¶

Correlated EBS failure within an AZ is the observation that EBS volume performance degradation is not independent across volumes — a single underlying networking, control-plane, or storage-fabric event can push many volumes in the same availability zone into degraded-performance mode simultaneously.

"We also see these frequently as correlated failure inside of a single zone, even using io2 volumes." (Source: ; post includes a correlated-failure screenshot.)

Why it contradicts the naive design assumption¶

The default replication design for OLTP-on-AWS is:

3 volumes (1 primary + 2 replicas) inside an AZ for fast local replication and low write latency.
Replicas on different volumes (implied different failure domains at the block-storage layer).

The naive assumption is that "different volumes" ≈ "independent failure". Correlated-AZ-failure breaks that assumption: a correlated degradation event hits all three volumes in the same AZ at once, so the primary plus both replicas drop into degraded mode together. Switching a query from a degraded primary to a degraded replica buys nothing.

Why io2 doesn't fix it¶

io2 is AWS's top-tier single-volume offering (4×–10× gp3 price, 99.999% durability SLA). Its guarantees cover single-volume durability and per-volume IOPS ceiling. It does not re-architect the shared storage-fabric path — the control plane, networking, and target adapters are still shared across the AZ. A fabric-layer event lands on io2 and gp3 alike.

"even using io2 volumes" is the load-bearing observation in the PlanetScale post. It rules out the "just pay for io2" escape hatch.

Structural implication¶

If replicas-on-different-volumes-in-same-AZ are not independent, then fault-isolation on network-attached block storage requires one of:

Cross-AZ replication — pay the write latency tax for synchronous replicas in another AZ (classic Multi-AZ RDS).
Cross-region replication — even more latency, operational complexity.
Don't share the storage fabric. Use direct-attached NVMe per instance, replicate at the cluster application layer. See patterns/direct-attached-nvme-with-replication + patterns/shared-nothing-storage-topology + systems/planetscale-metal.

PlanetScale picks option 3 for Metal: each node has its own local drive; the only shared substrate between primary and replicas is the EC2 instance fleet + network, not a block-storage fabric.

Sibling primitives on the wiki¶

concepts/correlated-failure — general concept; this page is the EBS-fabric-layer instance.
concepts/blast-radius-multiplier-at-fleet-scale — fleet- scale probability multiplier. Correlated-AZ-failure is a positive correlation term that makes that multiplier worse.
concepts/noisy-neighbor — the within-volume version (shared-fabric variance from a co-tenant workload).
concepts/availability-zone-failure-drill — the AZ-wide full-outage framing; this page is the partial-failure analogue within AZ.

Seen in¶

— canonical wiki instance. Observed "frequently" across PlanetScale's production fleet, verbatim stated to hold on io2 as well as gp3. No cross-AZ / cross-region correlation figures disclosed; no per-AZ incidence rate.

Caveats¶

PlanetScale doesn't publish a measured frequency or a cluster enumeration of the events. The screenshot is a single example. The qualitative claim — "we see these frequently" — is the only quantifier in the post.
The correlation mechanism isn't identified. Candidates are control-plane failover, shared networking-gear failover, or storage-fabric hot-spotting. The 1–10-minute typical event duration PlanetScale reports is consistent with all three mechanisms.