Skip to content

PATTERN Cited by 1 source

Direct-attached NVMe with replication

Pattern

Run each database instance on a direct-attached NVMe drive (local, fast, ephemeral) and solve the "instance dies, data dies" durability problem with application- layer replication (primary + N replicas + automated failover + backups) rather than with network-attached block storage.

┌──────────────────────┐    ┌──────────────────────┐    ┌──────────────────────┐
│  Primary instance    │───►│  Replica 1           │───►│  Replica 2           │
│  ┌────────────────┐  │    │  ┌────────────────┐  │    │  ┌────────────────┐  │
│  │ DB engine      │  │    │  │ DB engine      │  │    │  │ DB engine      │  │
│  └───────┬────────┘  │    │  └───────┬────────┘  │    │  └───────┬────────┘  │
│          │ 50μs      │    │          │ 50μs      │    │          │ 50μs      │
│  ┌───────▼────────┐  │    │  ┌───────▼────────┐  │    │  ┌───────▼────────┐  │
│  │ Direct NVMe    │  │    │  │ Direct NVMe    │  │    │  │ Direct NVMe    │  │
│  └────────────────┘  │    │  └────────────────┘  │    │  └────────────────┘  │
└──────────────────────┘    └──────────────────────┘    └──────────────────────┘
         ▲                                                       ▲
         │           Frequent backups                            │
         └───────────────► Object storage ◄──────────────────────┘

When to use it

  • OLTP workloads where IOPS + latency matter. Database commits, session stores, real-time analytics — any workload where every millisecond shows up in user- visible tail latency.
  • Workloads that routinely saturate the cloud IOPS budget. If you're paying a lot for provisioned IOPS and still queueing, direct NVMe removes the cap.
  • When you control the DB layer. You need to run your own replication, failover, and backup infrastructure.

When not to use it

  • Storage layer isn't yours to design. A managed service on EBS / PD / similar isn't negotiable — the pattern happens a layer below you.
  • Live-resize is critical. The pattern requires provisioning a new instance + migrating to grow capacity (zero-downtime but not instant). If you need minute-scale volume expansion at 3am, network-attached wins.
  • Geographic replication requirements exceed local-DC replicas. Local direct-attached + cross-AZ replica still requires a working AZ failover story.
  • Small / bursty workloads. At low QPS, the difference between 50 μs and 250 μs is invisible, and the operational complexity isn't worth it.

What it trades off

Property Direct NVMe + replication Network-attached (EBS-class)
IO latency ~50 μs ~250 μs
IOPS ceiling Hardware limit (hundreds of thousands) Cap (3,000 GP3 default; paid upward)
Instance loss recovery Replica takes over; primary re-provisioned Volume re-attaches to new instance
Live-resize No (migrate to bigger node) Yes
Durability floor P^N via replication + backups Provider internal replication
Operational surface Replication, failover, backups in your stack Hidden behind volume API

Implementations

PlanetScale Metal

"Each Metal cluster comes with a primary and two replicas by default for extremely durable data. […] Behind the scenes, we handle spinning up new nodes and migrating your data from your old instances to the new ones with zero downtime. […] With a Metal database, there is no artificial cap on IOPS." (Source: sources/2025-03-13-planetscale-io-devices-and-latency)

Canonical wiki instance. Three-node primary+2-replicas shape on direct-attached NVMe with Vitess or Postgres.

Other instances (informational)

  • CockroachDB on local SSD — similar structure: direct- attached local NVMe + Raft replication. Different consistency model (Raft quorum) but same durability substrate logic.
  • MongoDB replica sets on local SSD — same shape at the document-database layer.
  • Bare-metal MySQL + streaming replication — the pre-cloud default; replaced in most deployments by EBS- backed RDS / Aurora / Cloud SQL / etc.

Why the pattern didn't dominate the cloud era

Early cloud-database services optimised for the average application workload (stateless app server + modest DB), where EBS's latency cost was acceptable and elastic resizing was operationally valuable. As storage-heavy SaaS workloads scaled, the IOPS cap + latency penalty became visible in p99.9 and in monthly bills. The pattern reappears now because:

  • Local-NVMe instance families (i3 / i4 / i7 / im4gn) became general-purpose, not just for Lambda-style ephemeral storage.
  • Kubernetes / orchestration made primary-failover + data-migration workflows operationally feasible.
  • Customers got tired of paying for provisioned IOPS.

Seen in

  • First self-reported production-migration datum for the pattern on PlanetScale's own internal workload. Rafer Hazen (2025-03-11) migrated the Insights backing database — 8 MySQL/Vitess shards serving 10k UPDATE/INSERT/sec from 800 concurrent writer threads — from EBS (with provisioned- IOPS upgrades) to Metal direct-attached NVMe. Migration applied canary- shard substrate-migration in busiest-first variant: worst shard upgraded first, soaked for "a few days", remaining 7 rolled out to "nearly identical improvement in performance." Outcome: "substantial decrease in latency across all the measured percentiles" (p50/p90/p95/p99) + lower Kafka-consumer backlog + capacity headroom. Canonical wiki statement of Metal paying for an [[concepts/io-latency-sensitive- workload|I/O-latency-sensitive]] workload — 8-shard IOPS-sharding + EBS provisioned-IOPS was not enough; latency (not IOPS throughput) was the binding constraint. Load-bearing substrate-swap framing: "Without making any changes to our application, architecture, or sharding configuration, we were able to realize substantial performance improvements by upgrading to PlanetScale Metal."

  • Canonical durability-math instance for the pattern. Richard Crowley (PlanetScale, 2025-03-11) canonicalises the pattern's durability argument with explicit assumptions and probabilities: primary + 2 replicas across 3 AZs, MySQL semi-sync cross-2-AZ ack, daily tested-restore backups, automated replica replacement. Under self-described-unfair- to-Metal assumptions (1%-monthly-instance-failure, 5-minute EBS re-attach, 5-hour backup restore), write-availability loss ≈ 0.000001% monthly and data loss ≈ 0.00000000003% monthly. First wiki canonical statement of the pattern's durability probabilities with explicit assumptions. Crowley calls out the single structural disadvantage vs network-attached storage: "the ability to re-attach a storage volume is a significant advantage over having to restore a backup, purely in terms of wall-clock time" — matters only for the single-replica-loss case, not the multi-replica cases that dominate the math. Price-performance canonicalisation: 58.4-58.5 IOPS/$ uniform on i4i (local NVMe) vs 0.84-13.2 IOPS/$ on r6a + EBS — a 13-17× price-performance advantage at 4xlarge scale, with additional discount runway from Reserved Instances / Savings Plans that EBS structurally lacks. Canonical production-migration datum: million-QPS workload, p99 query latency 9ms → 4ms from flipping the storage substrate alone.

  • sources/2025-03-13-planetscale-io-devices-and-latency — canonical article-length treatment: latency hierarchy, durability math, IOPS-cap critique, and the explicit Metal embodiment.
Last updated · 542 distilled / 1,571 read