PATTERN Cited by 1 source
Direct-attached NVMe with replication¶
Pattern¶
Run each database instance on a direct-attached NVMe drive (local, fast, ephemeral) and solve the "instance dies, data dies" durability problem with application- layer replication (primary + N replicas + automated failover + backups) rather than with network-attached block storage.
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Primary instance │───►│ Replica 1 │───►│ Replica 2 │
│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ DB engine │ │ │ │ DB engine │ │ │ │ DB engine │ │
│ └───────┬────────┘ │ │ └───────┬────────┘ │ │ └───────┬────────┘ │
│ │ 50μs │ │ │ 50μs │ │ │ 50μs │
│ ┌───────▼────────┐ │ │ ┌───────▼────────┐ │ │ ┌───────▼────────┐ │
│ │ Direct NVMe │ │ │ │ Direct NVMe │ │ │ │ Direct NVMe │ │
│ └────────────────┘ │ │ └────────────────┘ │ │ └────────────────┘ │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
▲ ▲
│ Frequent backups │
└───────────────► Object storage ◄──────────────────────┘
When to use it¶
- OLTP workloads where IOPS + latency matter. Database commits, session stores, real-time analytics — any workload where every millisecond shows up in user- visible tail latency.
- Workloads that routinely saturate the cloud IOPS budget. If you're paying a lot for provisioned IOPS and still queueing, direct NVMe removes the cap.
- When you control the DB layer. You need to run your own replication, failover, and backup infrastructure.
When not to use it¶
- Storage layer isn't yours to design. A managed service on EBS / PD / similar isn't negotiable — the pattern happens a layer below you.
- Live-resize is critical. The pattern requires provisioning a new instance + migrating to grow capacity (zero-downtime but not instant). If you need minute-scale volume expansion at 3am, network-attached wins.
- Geographic replication requirements exceed local-DC replicas. Local direct-attached + cross-AZ replica still requires a working AZ failover story.
- Small / bursty workloads. At low QPS, the difference between 50 μs and 250 μs is invisible, and the operational complexity isn't worth it.
What it trades off¶
| Property | Direct NVMe + replication | Network-attached (EBS-class) |
|---|---|---|
| IO latency | ~50 μs | ~250 μs |
| IOPS ceiling | Hardware limit (hundreds of thousands) | Cap (3,000 GP3 default; paid upward) |
| Instance loss recovery | Replica takes over; primary re-provisioned | Volume re-attaches to new instance |
| Live-resize | No (migrate to bigger node) | Yes |
| Durability floor | P^N via replication + backups | Provider internal replication |
| Operational surface | Replication, failover, backups in your stack | Hidden behind volume API |
Implementations¶
PlanetScale Metal¶
"Each Metal cluster comes with a primary and two replicas by default for extremely durable data. […] Behind the scenes, we handle spinning up new nodes and migrating your data from your old instances to the new ones with zero downtime. […] With a Metal database, there is no artificial cap on IOPS." (Source: sources/2025-03-13-planetscale-io-devices-and-latency)
Canonical wiki instance. Three-node primary+2-replicas shape on direct-attached NVMe with Vitess or Postgres.
Other instances (informational)¶
- CockroachDB on local SSD — similar structure: direct- attached local NVMe + Raft replication. Different consistency model (Raft quorum) but same durability substrate logic.
- MongoDB replica sets on local SSD — same shape at the document-database layer.
- Bare-metal MySQL + streaming replication — the pre-cloud default; replaced in most deployments by EBS- backed RDS / Aurora / Cloud SQL / etc.
Why the pattern didn't dominate the cloud era¶
Early cloud-database services optimised for the average application workload (stateless app server + modest DB), where EBS's latency cost was acceptable and elastic resizing was operationally valuable. As storage-heavy SaaS workloads scaled, the IOPS cap + latency penalty became visible in p99.9 and in monthly bills. The pattern reappears now because:
- Local-NVMe instance families (i3 / i4 / i7 / im4gn) became general-purpose, not just for Lambda-style ephemeral storage.
- Kubernetes / orchestration made primary-failover + data-migration workflows operationally feasible.
- Customers got tired of paying for provisioned IOPS.
Seen in¶
-
— First self-reported production-migration datum for the pattern on PlanetScale's own internal workload. Rafer Hazen (2025-03-11) migrated the Insights backing database — 8 MySQL/Vitess shards serving 10k UPDATE/INSERT/sec from 800 concurrent writer threads — from EBS (with provisioned- IOPS upgrades) to Metal direct-attached NVMe. Migration applied canary- shard substrate-migration in busiest-first variant: worst shard upgraded first, soaked for "a few days", remaining 7 rolled out to "nearly identical improvement in performance." Outcome: "substantial decrease in latency across all the measured percentiles" (p50/p90/p95/p99) + lower Kafka-consumer backlog + capacity headroom. Canonical wiki statement of Metal paying for an [[concepts/io-latency-sensitive- workload|I/O-latency-sensitive]] workload — 8-shard IOPS-sharding + EBS provisioned-IOPS was not enough; latency (not IOPS throughput) was the binding constraint. Load-bearing substrate-swap framing: "Without making any changes to our application, architecture, or sharding configuration, we were able to realize substantial performance improvements by upgrading to PlanetScale Metal."
-
— Canonical durability-math instance for the pattern. Richard Crowley (PlanetScale, 2025-03-11) canonicalises the pattern's durability argument with explicit assumptions and probabilities: primary + 2 replicas across 3 AZs, MySQL semi-sync cross-2-AZ ack, daily tested-restore backups, automated replica replacement. Under self-described-unfair- to-Metal assumptions (1%-monthly-instance-failure, 5-minute EBS re-attach, 5-hour backup restore), write-availability loss ≈ 0.000001% monthly and data loss ≈ 0.00000000003% monthly. First wiki canonical statement of the pattern's durability probabilities with explicit assumptions. Crowley calls out the single structural disadvantage vs network-attached storage: "the ability to re-attach a storage volume is a significant advantage over having to restore a backup, purely in terms of wall-clock time" — matters only for the single-replica-loss case, not the multi-replica cases that dominate the math. Price-performance canonicalisation: 58.4-58.5 IOPS/$ uniform on
i4i(local NVMe) vs 0.84-13.2 IOPS/$ onr6a+ EBS — a 13-17× price-performance advantage at4xlargescale, with additional discount runway from Reserved Instances / Savings Plans that EBS structurally lacks. Canonical production-migration datum: million-QPS workload, p99 query latency 9ms → 4ms from flipping the storage substrate alone. - sources/2025-03-13-planetscale-io-devices-and-latency — canonical article-length treatment: latency hierarchy, durability math, IOPS-cap critique, and the explicit Metal embodiment.
Related¶
- systems/planetscale-metal
- systems/nvme-ssd
- systems/aws-ebs
- concepts/network-attached-storage-latency-penalty
- concepts/iops-throttle-network-storage
- concepts/storage-replication-for-durability
- concepts/storage-latency-hierarchy
- concepts/leader-follower-replication
- patterns/leader-based-partition-replication