Skip to content

CONCEPT Cited by 1 source

Performance-variance degradation

Definition

Performance-variance degradation is a failure mode where a storage / compute / network substrate delivers some delivered performance level on some fraction of time — and that fraction is in the SLO (guaranteed by the provider), not a bug, not an outage. The canonical wiki instance is AWS EBS gp3:

When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year. (EBS gp3 docs, quoted by sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs.)

Translated: 1% of the year, the volume can deliver as little as 0% of provisioned IOPS and still meet the SLO. That's 14 minutes per day or 86 hours per year of potential degraded operation.

The structural problem

The SLO floor is unbounded on the low side: the contract guarantees a ceiling on the fraction of time degradation is allowed, but no floor on the degraded delivered performance. In the PlanetScale post's worded framing:

This is not a secret, it's from the documentation. AWS doesn't describe how failure is distributed for gp3 volumes, but in our experience it tends to last 1–10 minutes at a time. This is likely the time needed for a failover in a network or compute component.

During the 1% of the year a volume is in the allowed-degradation window, delivered IOPS can be 1% of provisioned, 10%, 50%, or 89% — all inside the SLO. A workload sized for 100% of provisioned IOPS sees 1% delivered as a full outage.

Why overprovisioning doesn't fix it

"When there are no guarantees, even overprovisioning doesn't solve the problem." If you provision 2× the IOPS you need, during a degradation window that drops to 10% of provisioned you see 20% of the original target — still worse than the 50%-of- target threshold where the app errors. The substrate's floor doesn't scale with the ceiling.

Expected-events arithmetic

PlanetScale's back-of-envelope for a single volume, under the assumption that each degradation event is independent, uniform between 1%–89% severity, lasts 10 minutes, and the application errors at 50% throughput loss:

  • ~43 events/month total.
  • ~21 events/month cross the application tolerance threshold — i.e. customer-impacting.

This is the per-volume basis figure that drives the fleet-scale multiplier on concepts/blast-radius-multiplier-at-fleet-scale.

io2: more nines, same shape

AWS's io2 volumes are sold at 4×–10× the price of gp3 and carry a 99.999% durability SLA. PlanetScale's post says you'd still be "in a failure condition roughly one third of the time in any given year on just that one database" under the same fleet assumptions (256 shards × 3 = 768 volumes). Moving tiers buys more nines on data loss; it does not flatten the performance-variance floor enough to eliminate customer-visible events at fleet scale.

Why it exists

"This rate of degradation far exceeds that of a single disk drive or SSD. This is the cost of separating storage and compute and the sheer complexity of the software and networking components between the client and the backing disks for the volume."

The primitive underneath is concepts/compute-storage-separation + network hop: variance is the queueing-theory tail of the path client → storage adapter → storage fabric → target adapter → media. AWS has spent a decade shrinking this variance via Nitro offload, SRD replacing TCP, and custom Nitro SSDs — see sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws. The variance floor has shrunk; it has not vanished.

Customer-side mitigations

Seen in

Last updated · 319 distilled / 1,201 read