PATTERN

Automated volume-health monitoring¶

Problem¶

Network-attached block storage (EBS and equivalents) experiences performance-variance degradation as a first-class failure mode, not an outage. The provider's health check only reports catastrophic events — volume unmounted, EC2 instance gone. The customer-visible failure mode — a 100 ms → 500 ms/op latency spike that lasts 1–10 minutes — is not surfaced by provider-side health checks.

Because at fleet scale the probability of at least one active degradation event is near 1 (see concepts/blast-radius-multiplier-at-fleet-scale), manual human paging does not scale — multiple events per day, each lasting minutes, on a fleet of hundreds of volumes.

The customer needs a mitigation that runs in-band with the database / application fleet, detects partial-failure faster than the provider reports it, and triggers automated recovery.

Solution¶

Run in-application heuristic volume-health monitoring that watches the volume's client-side metrics — the ones that correlate with customer-facing performance — and makes a binary "this volume is degraded" classification on a timescale of seconds to tens of seconds.

PlanetScale's disclosed heuristics:

Read/write latency. Post's example: 200–500 ms/op write latency on a volume that was doing single-digit-ms is the strong signal.
Idle %. A healthy volume at 67% idle dropping to 0% idle is corroborating evidence.
Simple synthetic smoke tests. "basic tests like making sure we can write to a file" — a 100-byte write to a test file that normally completes in <1 ms suddenly taking >100 ms is a volume-is-sick signal.

These heuristics are noisy individually but cheap and specific in combination. They intentionally use client-side latency (what the customer sees) rather than provider-reported metrics (what AWS SLOs against).

Once the volume is classified degraded, the monitor triggers the companion mitigation pattern patterns/zero-downtime-reparent-on-degradation — which actually moves traffic off the degraded volume.

Structure¶

 ┌─────────────────────────────────────────────────┐
 │ Database process on EC2 instance with EBS vol   │
 │                                                 │
 │  ┌─────────────────────────────────────────┐    │
 │  │ Volume-health monitor (in-process or    │    │
 │  │   sidecar, sampling 1/s):               │    │
 │  │   - moving-window p99 write latency     │    │
 │  │   - moving-window read latency          │    │
 │  │   - idle % via iostat                   │    │
 │  │   - synthetic smoke test (write 100 B,  │    │
 │  │     time it)                            │    │
 │  │                                         │    │
 │  │ If (p99 > threshold) && (smoke > threshold) │
 │  │  for W seconds:                         │    │
 │  │    → classify DEGRADED                  │    │
 │  │    → trigger reparent                   │    │
 │  └─────────────────────────────────────────┘    │
 └─────────────────────────────────────────────────┘

When to use¶

OLTP / real-time-serving workloads on network-attached block storage. Batch / analytic workloads can absorb a 10-minute slowdown; OLTP cannot.
Fleet large enough that manual paging is not viable. A single-database workload with a 1%-per-volume event rate can usually be handled by paging; hundreds of volumes cannot.
Workload owns the volume lifecycle. The pattern assumes the operator can provision a replacement volume + reparent traffic to a healthy node — not applicable to managed services that hide the volume layer from the customer.

When not to use¶

Read-heavy, cache-heavy workloads. If the hot working set lives in RAM, short-window EBS degradation rarely reaches the customer. Paging works.
Already on a substrate without variance floor. On direct-attached NVMe, the monitor's trigger condition never fires (see systems/planetscale-metal + patterns/shared-nothing-storage-topology).

Trade-offs¶

Recycling a volume isn't free. Each detected event leads to a reparent + replacement-volume provision, which has its own cost and risk (brief replication lag, cluster topology churn). Tune thresholds to balance false positives against impact-window length.
Cannot detect events before they happen. "it's impossible to detect this failure before it happens" (Van Wiggeren, PlanetScale). The pattern clamps the impact window, it doesn't eliminate the window.
Heuristics drift. As the underlying substrate changes (AWS fleet rollouts, instance-type migrations), latency and idle-% thresholds need retuning.
Structural fix usually beats monitoring. The wiki's canonical structural answer to EBS variance is to move off EBS for OLTP workloads — see patterns/direct-attached-nvme-with-replication + systems/planetscale-metal. Monitoring is the operational bandage while running on EBS.

Seen in¶

— canonical wiki instance. PlanetScale's production deployment on millions of EBS volumes; heuristics named (latency, idle %, write-file smoke test); reparent in seconds.
— Brian Morrison II, 2023-09-27. Sibling pattern applied to storage-capacity rather than storage-latency: PlanetScale Vitess Operator monitors cloud PVC utilisation and auto-grows volumes via cloud-provider APIs before running out of space — "we have monitoring mechanisms in place to detect when provisioned cloud storage that serves a database starts nearing capacity. When this occurs, our internal systems will use the cloud providers' APIs to automatically allocate additional space so the databases that are being served by that storage do not stop from capacity issues." Same pattern shape (continuously monitor → mitigate via automation) applied to a different volume-health signal.