Skip to content

PATTERN Cited by 1 source

Automated volume-health monitoring

Problem

Network-attached block storage (EBS and equivalents) experiences performance-variance degradation as a first-class failure mode, not an outage. The provider's health check only reports catastrophic events — volume unmounted, EC2 instance gone. The customer-visible failure mode — a 100 ms → 500 ms/op latency spike that lasts 1–10 minutes — is not surfaced by provider-side health checks.

Because at fleet scale the probability of at least one active degradation event is near 1 (see concepts/blast-radius-multiplier-at-fleet-scale), manual human paging does not scale — multiple events per day, each lasting minutes, on a fleet of hundreds of volumes.

The customer needs a mitigation that runs in-band with the database / application fleet, detects partial-failure faster than the provider reports it, and triggers automated recovery.

Solution

Run in-application heuristic volume-health monitoring that watches the volume's client-side metrics — the ones that correlate with customer-facing performance — and makes a binary "this volume is degraded" classification on a timescale of seconds to tens of seconds.

PlanetScale's disclosed heuristics:

  • Read/write latency. Post's example: 200–500 ms/op write latency on a volume that was doing single-digit-ms is the strong signal.
  • Idle %. A healthy volume at 67% idle dropping to 0% idle is corroborating evidence.
  • Simple synthetic smoke tests. "basic tests like making sure we can write to a file" — a 100-byte write to a test file that normally completes in <1 ms suddenly taking >100 ms is a volume-is-sick signal.

These heuristics are noisy individually but cheap and specific in combination. They intentionally use client-side latency (what the customer sees) rather than provider-reported metrics (what AWS SLOs against).

Once the volume is classified degraded, the monitor triggers the companion mitigation pattern patterns/zero-downtime-reparent-on-degradation — which actually moves traffic off the degraded volume.

Structure

 ┌─────────────────────────────────────────────────┐
 │ Database process on EC2 instance with EBS vol   │
 │                                                 │
 │  ┌─────────────────────────────────────────┐    │
 │  │ Volume-health monitor (in-process or    │    │
 │  │   sidecar, sampling 1/s):               │    │
 │  │   - moving-window p99 write latency     │    │
 │  │   - moving-window read latency          │    │
 │  │   - idle % via iostat                   │    │
 │  │   - synthetic smoke test (write 100 B,  │    │
 │  │     time it)                            │    │
 │  │                                         │    │
 │  │ If (p99 > threshold) && (smoke > threshold) │
 │  │  for W seconds:                         │    │
 │  │    → classify DEGRADED                  │    │
 │  │    → trigger reparent                   │    │
 │  └─────────────────────────────────────────┘    │
 └─────────────────────────────────────────────────┘

When to use

  • OLTP / real-time-serving workloads on network-attached block storage. Batch / analytic workloads can absorb a 10-minute slowdown; OLTP cannot.
  • Fleet large enough that manual paging is not viable. A single-database workload with a 1%-per-volume event rate can usually be handled by paging; hundreds of volumes cannot.
  • Workload owns the volume lifecycle. The pattern assumes the operator can provision a replacement volume + reparent traffic to a healthy node — not applicable to managed services that hide the volume layer from the customer.

When not to use

  • Read-heavy, cache-heavy workloads. If the hot working set lives in RAM, short-window EBS degradation rarely reaches the customer. Paging works.
  • Already on a substrate without variance floor. On direct-attached NVMe, the monitor's trigger condition never fires (see systems/planetscale-metal + patterns/shared-nothing-storage-topology).

Trade-offs

  • Recycling a volume isn't free. Each detected event leads to a reparent + replacement-volume provision, which has its own cost and risk (brief replication lag, cluster topology churn). Tune thresholds to balance false positives against impact-window length.
  • Cannot detect events before they happen. "it's impossible to detect this failure before it happens" (Van Wiggeren, PlanetScale). The pattern clamps the impact window, it doesn't eliminate the window.
  • Heuristics drift. As the underlying substrate changes (AWS fleet rollouts, instance-type migrations), latency and idle-% thresholds need retuning.
  • Structural fix usually beats monitoring. The wiki's canonical structural answer to EBS variance is to move off EBS for OLTP workloads — see patterns/direct-attached-nvme-with-replication + systems/planetscale-metal. Monitoring is the operational bandage while running on EBS.

Seen in

Last updated · 319 distilled / 1,201 read