Skip to content

PLANETSCALE 2025-03-18

Read original ↗

PlanetScale — The Real Failure Rate of EBS

Summary

PlanetScale's Nick Van Wiggeren publishes a production retrospective on the real-world failure rate of Amazon EBS observed across "millions of volumes … tens of thousands created and destroyed every day". The core argument is that EBS's dominant failure mode at scale is performance degradation, not data loss — and performance degradation with no lower bound on delivered throughput is, to a customer-serving OLTP database, indistinguishable from a full outage. AWS's own gp3 documentation promises "at least 90 percent of provisioned IOPS 99 percent of the time in a given year", which is 14 minutes per day or 86 hours per year of potential degraded operation. Production systems "are not built to handle this level of sudden variance", and even overprovisioning doesn't solve the problem because the floor is unbounded. Van Wiggeren then walks through the fleet-scale blast-radius multiplier: on a 256-shard database with one primary + two replicas per shard (768 gp3 volumes), and the charitable assumption that each volume degradation lasts 10 minutes with a random loss between 1%–89% and a 50% application tolerance, "there is a 99.65% chance you have at least one node experiencing a production-impacting event at any given time." io2 — AWS's 4×–10× premium SSD tier — doesn't save you either: you'd still expect to be "in a failure condition roughly one third of the time in any given year on just that one database!" On top of independent-volume failures, PlanetScale observes correlated failures across an entire availability zone, even on io2. PlanetScale's mitigations are automated volume-health monitoring (read/write latency, idle %, simple write-a-file smoke tests), zero-downtime reparent in seconds to a healthy node in the same cluster, and automatic replacement volume provisioning — clamping the maximum impact window so the event is usually over before a human intervenes. The structural fix is PlanetScale Metal: a shared-nothing architecture on local storage (direct-attached NVMe) with cluster-level replication providing durability — the rest of the shards / nodes stay healthy when one volume degrades, because they don't share a volume. The post is complementary to Dicken's 2025-03-13 IO devices and latency piece on the same product launch: that one is the latency argument against network-attached storage; this one is the reliability argument.

Key takeaways

  1. EBS's documented SLO encodes its dominant failure mode. "When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year." "This means a volume is expected to experience under 90% of its provisioned performance 1% of the time. That's 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far exceeds that of a single disk drive or SSD." (Source: article §"Here's what 'slow' looks like" quoting EBS docs.) Canonicalised on the wiki as concepts/performance-variance-degradation.

  2. "Slow" is the failure mode, not "broken". "While full failure and data loss is very rare with EBS, 'slow' is often as bad as 'failed', and that happens much much more often." Worked example: a volume steady for 10 hours at 67% idle + single-digit-ms write latency suddenly spikes to 200ms–500ms/op at ~16:00, idle races to zero, "the volume is effectively blocked from reading and writing data." To the application = failure; to the user = a 500 after a 10-second wait; to the operator = an incident. "At PlanetScale, we consider this full failure because our customers do." Canonicalised as concepts/slow-is-failure.

  3. Even short degradation windows break real-time workloads. PlanetScale reports typical event duration of 1–10 minutes: "This is likely the time needed for a failover in a network or compute component." A few-second blip on a steady volume is enough to create a database blip (the post shows both AWS Console + database-side graphs). Canonicalised as concepts/blip-induced-incident (via concepts/slow-is-failure).

  4. Expected-events-per-volume arithmetic: ~43 events/month, ~21 impacting. Van Wiggeren's back-of-envelope: assume each degradation event is random with 1%–89% reduction, your app tolerates 50% throughput loss before erroring, and each event lasts 10 minutes — "every volume would experience about 43 events per month, with at least 21 of them causing downtime!" (Source: article §"if each individual failure event lasts 10 minutes".) Canonical wiki arithmetic for concepts/performance-variance-degradation at single-volume granularity.

  5. Fleet-scale multiplier: 768 volumes → 99.65% chance of an active impacting event at any moment. "In a large database composed of many shards, this failure compounds. Assume a 256 shard database where each shard has one primary and two replicas: a total of 768 gp3 EBS volumes provisioned. If we take the 50% threshold from above, there is a 99.65% chance you have at least one node experiencing a production-impacting event at any given time." Canonicalised as concepts/blast-radius-multiplier-at-fleet-scale.

  6. io2 buys nines but not immunity. "Even if you use io2, which AWS sells at 4x to 10x the price, you'd still be expected to be in a failure condition roughly one third of the time in any given year on just that one database!" io2 is AWS's most durable single-volume tier (99.999% durability SLA); performance-tier upgrade does not flatten the variance floor on a large fleet.

  7. Correlated-failure-within-AZ happens on io2 too. "To make matters worse, we also see these frequently as correlated failure inside of a single zone, even using io2 volumes." Post includes a correlated-failure screenshot. Canonicalised as concepts/correlated-ebs-failure — complements the existing wiki primitive concepts/correlated-failure on the EBS axis, and contradicts the naive "replicas across volumes in the same AZ eliminate shared fate" assumption.

  8. At fleet scale the rate is 100%. "With enough volumes, the rate of experiencing EBS failure is 100%: our automated mitigations are consistently recycling underperforming EBS volumes to reduce customer-impact, and we expect to see multiple events on a daily basis." Canonicalised as the load-bearing fleet-operations datum on systems/aws-ebs + companies/planetscale.

  9. Mitigations clamp the maximum impact window, not the failure rate. PlanetScale's automated volume-health monitoring watches read/write latency + idle % + simple write-a-file smoke tests; when a volume crosses heuristics, PlanetScale performs a zero-downtime reparent in seconds to another node in the cluster and brings up a replacement volume automatically. "This doesn't reduce the impact to zero, as it's impossible to detect this failure before it happens, but it does ensure the majority of the cases don't require a human to remediate and are over before users notice." Canonical wiki instance of patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation.

  10. Metal is the structural fix: shared-nothing on local NVMe. "This is why we built PlanetScale Metal. With a shared-nothing architecture that uses local storage instead of network-attached storage like EBS, the rest of the shards and nodes in a database are able to continue to operate without problem." Canonical wiki instance of patterns/shared-nothing-storage-topology + patterns/direct-attached-nvme-with-replication. Complements the 2025-03-13 Metal announcement's latency argument with the reliability argument.

Architectural numbers

  • gp3 SLO: "at least 90% of provisioned IOPS 99% of the time" — 14 min/day or 86 h/year of potential degraded operation.
  • Typical event duration: 1–10 minutes ("likely the time needed for a failover in a network or compute component").
  • Observed event severity example: steady → 200–500 ms/op write latency, 67% idle → 0% idle.
  • Per-volume expected event count: ~43/month, ~21 impacting under 50% tolerance + 10-min events assumption.
  • Fleet size in example: 256 shards × (1 primary + 2 replicas) = 768 gp3 volumes.
  • Fleet-wide probability of at least one active impacting event: 99.65% under the above assumptions.
  • io2 baseline impact rate on same fleet: "roughly one third of the time in any given year on just that one database" — i.e. ~33%.
  • io2 price premium: 4×–10× gp3.
  • Mitigation SLO: zero-downtime reparent completes in seconds.

Caveats

  • Back-of-envelope, not measured numbers. The 43 events/ month, 21 impacting, 99.65% chance figures are arithmetic under stated assumptions (50% app tolerance, 10-min event length, 1%–89% uniform severity). Real EBS events don't uniformly distribute that way, and PlanetScale doesn't publish measured distributions — these are pedagogical upper bounds.
  • AWS has specific engineering answers to this. AWS's own EBS team has spent a decade shrinking tail latency via Nitro, SRD, and Nitro SSDs — see sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws. The structural 1%/14-min-per-day SLO is AWS's published lower bound on guaranteed performance; actual observed performance on the median volume is much better. Van Wiggeren's post is the view from "fleet-scale customer facing impact", not average-volume telemetry.
  • 50% application-tolerance threshold is an input assumption. A database that tolerates 80% throughput loss before erroring sees a much smaller fraction of events as customer-impacting; the 99.65% figure depends on the chosen tolerance.
  • Correlated-AZ-failure frequency not quantified. Post says "we see these frequently" with a screenshot but gives no rate, distribution across AZs/regions, or trigger analysis. The claim that this holds "even on io2" is stated verbatim without a sample-size figure.
  • Metal's own failure modes not discussed. The post argues Metal is structurally better than EBS for failure isolation but does not discuss Metal's own failure surface: local NVMe drive failure, instance termination, noisy neighbours on the EC2 side, replication-lag under a correlated-AZ-power event, cross-replica consistency during zero-downtime reparent. The complementary IO devices and latency post argues durability comes from replication (3-node cluster → 1-in-a-million), but doesn't cover correlated failure modes.
  • Not a comparison against hyperscaler-internal storage. Internal AWS services like S3 and DynamoDB run on EBS's underlying hardware but via different software stacks — EBS is the customer-facing block device, not the internal storage primitive.
  • Post is PlanetScale marketing-adjacent. The piece is published against a PlanetScale Metal launch window and explicitly ends with the Metal pitch. The EBS-reliability framing is accurate (and the AWS docs quote is verbatim), but the framing choice ("is 14 min/day the ceiling or the floor?") is shaped by the launch context.

Cross-source continuity

  • Sibling of sources/2025-03-13-planetscale-io-devices-and-latency — same Metal-launch window, same thesis framing (network- attached storage is the wrong default for OLTP) from the complementary reliability angle vs that post's latency angle. Together they canonicalise the full Metal argument on the wiki: one post on the latency floor, one on the reliability floor, and both pointing at local NVMe + replication as the answer. These are the third and fourth first-party PlanetScale ingests on the wiki (after the 2024-09-09 Dicken B-trees post and 2024-10-22 Vectors beta + 2025-02-04 Lambert Slotted Counter).
  • Counterpoint to sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — AWS's own 13-year EBS retrospective is the inside view of shrinking this exact failure window via queueing-theory rigour, Nitro offload, SRD replacing TCP, and custom Nitro SSDs. Van Wiggeren's post is the outside view — fleet-scale customer observation that even after all that engineering, the variance floor is still customer-visible on a large database. The two together canonicalise the "close the gap" vs "skip the gap" architectural debate on the wiki as a named axis.
  • Complements concepts/noisy-neighbor — EBS degradation at the client-visible layer is ultimately a noisy-neighbor problem at the block-storage fabric, constrained by AWS's performance-isolation architecture. PlanetScale's patterns/automated-volume-health-monitoring is the customer-side mitigation for noisy neighbours on a substrate the customer can't isolate itself.
  • Extends concepts/correlated-failure to network-attached block storage — prior wiki instances cover AZ-level application failures, region-level DNS failures, and power-domain failures. This adds the correlated-performance-degradation-within-AZ variant at the storage-fabric layer, with the surprising datum that io2 does not eliminate it.
  • Extends systems/aws-ebs — adds the customer-fleet operational view of gp3/io2 failure frequency.

Source

Last updated · 319 distilled / 1,201 read