Skip to content

CONCEPT Cited by 1 source

Correlated EBS failure within an availability zone

Definition

Correlated EBS failure within an AZ is the observation that EBS volume performance degradation is not independent across volumes — a single underlying networking, control-plane, or storage-fabric event can push many volumes in the same availability zone into degraded-performance mode simultaneously.

"We also see these frequently as correlated failure inside of a single zone, even using io2 volumes." (Source: sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs; post includes a correlated-failure screenshot.)

Why it contradicts the naive design assumption

The default replication design for OLTP-on-AWS is:

  • 3 volumes (1 primary + 2 replicas) inside an AZ for fast local replication and low write latency.
  • Replicas on different volumes (implied different failure domains at the block-storage layer).

The naive assumption is that "different volumes""independent failure". Correlated-AZ-failure breaks that assumption: a correlated degradation event hits all three volumes in the same AZ at once, so the primary plus both replicas drop into degraded mode together. Switching a query from a degraded primary to a degraded replica buys nothing.

Why io2 doesn't fix it

io2 is AWS's top-tier single-volume offering (4×–10× gp3 price, 99.999% durability SLA). Its guarantees cover single-volume durability and per-volume IOPS ceiling. It does not re-architect the shared storage-fabric path — the control plane, networking, and target adapters are still shared across the AZ. A fabric-layer event lands on io2 and gp3 alike.

"even using io2 volumes" is the load-bearing observation in the PlanetScale post. It rules out the "just pay for io2" escape hatch.

Structural implication

If replicas-on-different-volumes-in-same-AZ are not independent, then fault-isolation on network-attached block storage requires one of:

  1. Cross-AZ replication — pay the write latency tax for synchronous replicas in another AZ (classic Multi-AZ RDS).
  2. Cross-region replication — even more latency, operational complexity.
  3. Don't share the storage fabric. Use direct-attached NVMe per instance, replicate at the cluster application layer. See patterns/direct-attached-nvme-with-replication + patterns/shared-nothing-storage-topology + systems/planetscale-metal.

PlanetScale picks option 3 for Metal: each node has its own local drive; the only shared substrate between primary and replicas is the EC2 instance fleet + network, not a block-storage fabric.

Sibling primitives on the wiki

Seen in

Caveats

  • PlanetScale doesn't publish a measured frequency or a cluster enumeration of the events. The screenshot is a single example. The qualitative claim — "we see these frequently" — is the only quantifier in the post.
  • The correlation mechanism isn't identified. Candidates are control-plane failover, shared networking-gear failover, or storage-fabric hot-spotting. The 1–10-minute typical event duration PlanetScale reports is consistent with all three mechanisms.
Last updated · 319 distilled / 1,201 read