PATTERN Cited by 1 source

Direct-attached NVMe with replication¶

Pattern¶

Run each database instance on a direct-attached NVMe drive (local, fast, ephemeral) and solve the "instance dies, data dies" durability problem with application- layer replication (primary + N replicas + automated failover + backups) rather than with network-attached block storage.

┌──────────────────────┐    ┌──────────────────────┐    ┌──────────────────────┐
│  Primary instance    │───►│  Replica 1           │───►│  Replica 2           │
│  ┌────────────────┐  │    │  ┌────────────────┐  │    │  ┌────────────────┐  │
│  │ DB engine      │  │    │  │ DB engine      │  │    │  │ DB engine      │  │
│  └───────┬────────┘  │    │  └───────┬────────┘  │    │  └───────┬────────┘  │
│          │ 50μs      │    │          │ 50μs      │    │          │ 50μs      │
│  ┌───────▼────────┐  │    │  ┌───────▼────────┐  │    │  ┌───────▼────────┐  │
│  │ Direct NVMe    │  │    │  │ Direct NVMe    │  │    │  │ Direct NVMe    │  │
│  └────────────────┘  │    │  └────────────────┘  │    │  └────────────────┘  │
└──────────────────────┘    └──────────────────────┘    └──────────────────────┘
         ▲                                                       ▲
         │           Frequent backups                            │
         └───────────────► Object storage ◄──────────────────────┘

When to use it¶

OLTP workloads where IOPS + latency matter. Database commits, session stores, real-time analytics — any workload where every millisecond shows up in user- visible tail latency.
Workloads that routinely saturate the cloud IOPS budget. If you're paying a lot for provisioned IOPS and still queueing, direct NVMe removes the cap.
When you control the DB layer. You need to run your own replication, failover, and backup infrastructure.

When not to use it¶

Storage layer isn't yours to design. A managed service on EBS / PD / similar isn't negotiable — the pattern happens a layer below you.
Live-resize is critical. The pattern requires provisioning a new instance + migrating to grow capacity (zero-downtime but not instant). If you need minute-scale volume expansion at 3am, network-attached wins.
Geographic replication requirements exceed local-DC replicas. Local direct-attached + cross-AZ replica still requires a working AZ failover story.
Small / bursty workloads. At low QPS, the difference between 50 μs and 250 μs is invisible, and the operational complexity isn't worth it.

What it trades off¶

Property	Direct NVMe + replication	Network-attached (EBS-class)
IO latency	~50 μs	~250 μs
IOPS ceiling	Hardware limit (hundreds of thousands)	Cap (3,000 GP3 default; paid upward)
Instance loss recovery	Replica takes over; primary re-provisioned	Volume re-attaches to new instance
Live-resize	No (migrate to bigger node)	Yes
Durability floor	P^N via replication + backups	Provider internal replication
Operational surface	Replication, failover, backups in your stack	Hidden behind volume API

Implementations¶

PlanetScale Metal ¶

"Each Metal cluster comes with a primary and two replicas by default for extremely durable data. […] Behind the scenes, we handle spinning up new nodes and migrating your data from your old instances to the new ones with zero downtime. […] With a Metal database, there is no artificial cap on IOPS." (Source: sources/2025-03-13-planetscale-io-devices-and-latency)

Canonical wiki instance. Three-node primary+2-replicas shape on direct-attached NVMe with Vitess or Postgres.

Other instances (informational)¶

CockroachDB on local SSD — similar structure: direct- attached local NVMe + Raft replication. Different consistency model (Raft quorum) but same durability substrate logic.
MongoDB replica sets on local SSD — same shape at the document-database layer.
Bare-metal MySQL + streaming replication — the pre-cloud default; replaced in most deployments by EBS- backed RDS / Aurora / Cloud SQL / etc.

Why the pattern didn't dominate the cloud era¶

Early cloud-database services optimised for the average application workload (stateless app server + modest DB), where EBS's latency cost was acceptable and elastic resizing was operationally valuable. As storage-heavy SaaS workloads scaled, the IOPS cap + latency penalty became visible in p99.9 and in monthly bills. The pattern reappears now because:

Local-NVMe instance families (i3 / i4 / i7 / im4gn) became general-purpose, not just for Lambda-style ephemeral storage.
Kubernetes / orchestration made primary-failover + data-migration workflows operationally feasible.
Customers got tired of paying for provisioned IOPS.

Seen in¶

— First self-reported production-migration datum for the pattern on PlanetScale's own internal workload. Rafer Hazen (2025-03-11) migrated the Insights backing database — 8 MySQL/Vitess shards serving 10k UPDATE/INSERT/sec from 800 concurrent writer threads — from EBS (with provisioned- IOPS upgrades) to Metal direct-attached NVMe. Migration applied canary- shard substrate-migration in busiest-first variant: worst shard upgraded first, soaked for "a few days", remaining 7 rolled out to "nearly identical improvement in performance." Outcome: "substantial decrease in latency across all the measured percentiles" (p50/p90/p95/p99) + lower Kafka-consumer backlog + capacity headroom. Canonical wiki statement of Metal paying for an [[concepts/io-latency-sensitive- workload|I/O-latency-sensitive]] workload — 8-shard IOPS-sharding + EBS provisioned-IOPS was not enough; latency (not IOPS throughput) was the binding constraint. Load-bearing substrate-swap framing: "Without making any changes to our application, architecture, or sharding configuration, we were able to realize substantial performance improvements by upgrading to PlanetScale Metal."
— Canonical durability-math instance for the pattern. Richard Crowley (PlanetScale, 2025-03-11) canonicalises the pattern's durability argument with explicit assumptions and probabilities: primary + 2 replicas across 3 AZs, MySQL semi-sync cross-2-AZ ack, daily tested-restore backups, automated replica replacement. Under self-described-unfair- to-Metal assumptions (1%-monthly-instance-failure, 5-minute EBS re-attach, 5-hour backup restore), write-availability loss ≈ 0.000001% monthly and data loss ≈ 0.00000000003% monthly. First wiki canonical statement of the pattern's durability probabilities with explicit assumptions. Crowley calls out the single structural disadvantage vs network-attached storage: "the ability to re-attach a storage volume is a significant advantage over having to restore a backup, purely in terms of wall-clock time" — matters only for the single-replica-loss case, not the multi-replica cases that dominate the math. Price-performance canonicalisation: 58.4-58.5 IOPS/$ uniform on i4i (local NVMe) vs 0.84-13.2 IOPS/$ on r6a + EBS — a 13-17× price-performance advantage at 4xlarge scale, with additional discount runway from Reserved Instances / Savings Plans that EBS structurally lacks. Canonical production-migration datum: million-QPS workload, p99 query latency 9ms → 4ms from flipping the storage substrate alone.
sources/2025-03-13-planetscale-io-devices-and-latency — canonical article-length treatment: latency hierarchy, durability math, IOPS-cap critique, and the explicit Metal embodiment.