PATTERN Cited by 1 source
Automated volume-health monitoring¶
Problem¶
Network-attached block storage (EBS and equivalents) experiences performance-variance degradation as a first-class failure mode, not an outage. The provider's health check only reports catastrophic events — volume unmounted, EC2 instance gone. The customer-visible failure mode — a 100 ms → 500 ms/op latency spike that lasts 1–10 minutes — is not surfaced by provider-side health checks.
Because at fleet scale the probability of at least one active degradation event is near 1 (see concepts/blast-radius-multiplier-at-fleet-scale), manual human paging does not scale — multiple events per day, each lasting minutes, on a fleet of hundreds of volumes.
The customer needs a mitigation that runs in-band with the database / application fleet, detects partial-failure faster than the provider reports it, and triggers automated recovery.
Solution¶
Run in-application heuristic volume-health monitoring that watches the volume's client-side metrics — the ones that correlate with customer-facing performance — and makes a binary "this volume is degraded" classification on a timescale of seconds to tens of seconds.
PlanetScale's disclosed heuristics:
- Read/write latency. Post's example: 200–500 ms/op write latency on a volume that was doing single-digit-ms is the strong signal.
- Idle %. A healthy volume at 67% idle dropping to 0% idle is corroborating evidence.
- Simple synthetic smoke tests. "basic tests like making sure we can write to a file" — a 100-byte write to a test file that normally completes in <1 ms suddenly taking >100 ms is a volume-is-sick signal.
These heuristics are noisy individually but cheap and specific in combination. They intentionally use client-side latency (what the customer sees) rather than provider-reported metrics (what AWS SLOs against).
Once the volume is classified degraded, the monitor triggers the companion mitigation pattern patterns/zero-downtime-reparent-on-degradation — which actually moves traffic off the degraded volume.
Structure¶
┌─────────────────────────────────────────────────┐
│ Database process on EC2 instance with EBS vol │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Volume-health monitor (in-process or │ │
│ │ sidecar, sampling 1/s): │ │
│ │ - moving-window p99 write latency │ │
│ │ - moving-window read latency │ │
│ │ - idle % via iostat │ │
│ │ - synthetic smoke test (write 100 B, │ │
│ │ time it) │ │
│ │ │ │
│ │ If (p99 > threshold) && (smoke > threshold) │
│ │ for W seconds: │ │
│ │ → classify DEGRADED │ │
│ │ → trigger reparent │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
When to use¶
- OLTP / real-time-serving workloads on network-attached block storage. Batch / analytic workloads can absorb a 10-minute slowdown; OLTP cannot.
- Fleet large enough that manual paging is not viable. A single-database workload with a 1%-per-volume event rate can usually be handled by paging; hundreds of volumes cannot.
- Workload owns the volume lifecycle. The pattern assumes the operator can provision a replacement volume + reparent traffic to a healthy node — not applicable to managed services that hide the volume layer from the customer.
When not to use¶
- Read-heavy, cache-heavy workloads. If the hot working set lives in RAM, short-window EBS degradation rarely reaches the customer. Paging works.
- Already on a substrate without variance floor. On direct-attached NVMe, the monitor's trigger condition never fires (see systems/planetscale-metal + patterns/shared-nothing-storage-topology).
Trade-offs¶
- Recycling a volume isn't free. Each detected event leads to a reparent + replacement-volume provision, which has its own cost and risk (brief replication lag, cluster topology churn). Tune thresholds to balance false positives against impact-window length.
- Cannot detect events before they happen. "it's impossible to detect this failure before it happens" (Van Wiggeren, PlanetScale). The pattern clamps the impact window, it doesn't eliminate the window.
- Heuristics drift. As the underlying substrate changes (AWS fleet rollouts, instance-type migrations), latency and idle-% thresholds need retuning.
- Structural fix usually beats monitoring. The wiki's canonical structural answer to EBS variance is to move off EBS for OLTP workloads — see patterns/direct-attached-nvme-with-replication + systems/planetscale-metal. Monitoring is the operational bandage while running on EBS.
Seen in¶
- sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs — canonical wiki instance. PlanetScale's production deployment on millions of EBS volumes; heuristics named (latency, idle %, write-file smoke test); reparent in seconds.
Related¶
- patterns/zero-downtime-reparent-on-degradation
- patterns/shared-nothing-storage-topology
- patterns/direct-attached-nvme-with-replication
- concepts/slow-is-failure
- concepts/partial-failure
- concepts/performance-variance-degradation
- concepts/blast-radius-multiplier-at-fleet-scale
- concepts/noisy-neighbor
- systems/aws-ebs
- systems/planetscale-metal