Skip to content

PLANETSCALE 2025-03-11

Read original ↗

PlanetScale — PlanetScale Metal: There's no replacement for displacement

Summary

Richard Crowley (PlanetScale, 2025-03-11) is PlanetScale's canonical architectural launch post for PlanetScale Metal: the product tier that substitutes Amazon EBS and Google Persistent Disk with fast, local NVMe drives on EC2 storage-optimised instance types. The post's title borrows the drag-racing idiom "there's no replacement for displacement" to frame its thesis: no amount of EBS engineering can mask the physics that network-attached storage is "very far away", and the path to low-variance, high-IOPS, low-dollar OLTP is to put the SSD next to the CPU and solve durability with application-layer semi-synchronous replication rather than with network-attached block replication. The article bundles three previously separate PlanetScale arguments into one architectural pitch: (1) the ~1 ms → μs-scale I/O-latency step (Dicken's 2025-03-13 IO-devices post), (2) the IOPS + throughput ceiling that EBS can't escape, and (3) a worked durability-math argument based on MySQL semi-sync replication across three AZs + backups + automated replica replacement. It then lands a price-performance punchline: a i4i.4xlarge Metal configuration delivers 58.4-58.5 IOPS per dollar on-demand vs 3.35-4.45 IOPS/$ on equivalent r6a + EBS configurations — a 13-17× price-performance advantage, with additional discount runway via Reserved Instances / Savings Plans that EBS does not offer.

The post is Tier-3 borderline — it's a product-launch frame — but clears scope decisively: the body is dense with distributed-systems internals (semi-sync replication across AZs, durability probabilities, EBS pricing-tier shape, instance-type-to-IOPS mapping) and concrete numbers (40,000 vs 220,000/400,000 IOPS, $80-$2,573 per TB EBS pricing, 1ms → 4ms P99 migration result, IOPS-per-dollar table). The durability math is illustrative (assumes EC2 instance failure rate of 1% in 30 days) and arguably conservative on MTTR, but the post writes the assumptions explicitly rather than hand-waving.

Key takeaways

  • Metal's architectural delta is one substitution. "Metal differs from the PlanetScale you already know well in exactly one way: We've substituted Amazon EBS and Google Persistent Disk with the fast, local NVMe drives available from the cloud providers." Everything else — Vitess, MySQL, Postgres, replication, backups, semi-sync — is unchanged. This is the canonical one-sentence statement of Metal's substrate swap. (Source: Crowley, verbatim.)

  • The network-attached-storage critique is physics, not engineering. "No amount of engineering from the cloud providers can mask the physical reality that network-attached storage is very far away." EBS latency "varies wildly" because "writes to these network-attached volumes pass through a NIC, network gear, and another machine before landing on a hard drive." This canonicalises the concepts/network-attached-storage-latency-penalty as a structural, not tunable floor — reinforcing the Olson-side "close the gap" vs PlanetScale-side "skip the gap" framing on systems/aws-ebs.

  • Canonical IOPS gap citation: r6i.4xlarge vs i4i.4xlarge. "Take the EBS performance of an r6i.4xlarge EC2 instance, for example. It can perform 40,000 IOPS if the volume or volumes can keep up. (Some EC2 instance types require striping multiple EBS volumes to achieve their maximum performance.) By contrast, an i4i.4xlarge EC2 instance can perform 220,000 random write or 400,000 random read IOPS using local NVMe SSDs!" This is the first wiki citation with explicit instance-type pairingr6i.4xlarge (compute/EBS) vs i4i.4xlarge (local NVMe): 5.5× more random write IOPS and 10× more random read IOPS on the same vCPU class. Wider than the 5× latency gap from the 2025-03-13 post.

  • EBS pricing ladder is a canonical staircase. "Network-attached storage is also expensive — $80 per TB for the slowest configuration to $2,573 per TB for the highest-performance EBS io2 volumes most instances can support." $80/TB → $2,573/TB is a ~32× price spread across the EBS volume-type ladder — a cleaner single-datum citation than Dicken's gp3 default + io2 premium splits. The i4i.4xlarge is positioned as "less expensive than an equivalently sized EBS gp3 volume with only 16,000 IOPS attached to an r6i.4xlarge" — a price inversion vs intuition.

  • Canonical production migration datum: 1ms → 4ms p99 on million-QPS workload. "Consider a real, million-QPS, production workload on PlanetScale. Its network-attached storage volumes report I/O latency around 1ms. We recently migrated it to PlanetScale Metal, using NVMe drives with I/O latency on the order of microseconds. As a result, its 99th percentile query latency dropped from 9ms to 4ms." Query-level p99 halved from 9ms to 4ms by flipping the storage substrate alone — zero schema change, zero query rewrite, zero topology change. Canonical wiki datum showing the storage-latency penalty propagates all the way to application-visible p99. Pairs with the 2025-10-14 Postgres 17 vs 18 benchmark's vendor-neutral EC2 finding that the i7i wins every scenario.

  • Metal's durability argument is MySQL semi-sync across 3 AZs + replica replacement + tested restores. "The basis for any distributed system's durability claim is replication. PlanetScale and PlanetScale Metal are no different. The replication that matters here is semi-synchronous, row-based, MySQL replication from a primary to two replicas distributed across three availability zones within a cloud region. Semi-synchronous replication ensures every write has reached stable storage in two availability zones before it's acknowledged to the client. Row-based replication integrates logically into transaction processing which allows readable replicas and backups. PlanetScale databases, Metal or not, are backed up at least daily. More importantly, each and every backup taken is tested by actually restoring it and starting up MySQL. This allows us to automatically and quickly replace failed replicas." This is the canonical three-pillar durability spec for Metal: (a) semi-sync cross-AZ replication (patterns/cross-dc-semi-sync-for-durability), (b) automated replica replacement, (c) tested backups (every backup restored and booted before being relied on).

  • Durability probability math: 99.999999% available, 99.99999999997% durable under stated assumptions. Assumes: 1% EC2 instance monthly failure rate, 5-minute re-attach + re-launch for EBS recovery, 5-hour backup-restore MTTR. Crowley calls out these are "unfair to Metal": "this is far more often than we observe in production" (1% monthly), "we think this is on the fast side of fair" (5-min re-attach), "this is wildly conservative compared to restore times we see in production for even terabyte-scale databases" (5-hour restore). Under these assumptions: loss of write availability ≈ 0.000001%; data loss (lose all three replicas) ≈ 0.00000000003%. Canonical wiki datum for the patterns/direct-attached-nvme-with-replication pattern's durability math — independent failures only; correlated failure (AZ power event, instance-type retirement) not in the formula.

  • EBS volume re-attach is Metal's only structural disadvantage — but only on MTTR for a single-replica loss. "When one of the virtual machines serving one of the three replicas has a fault, the ability to re-attach a storage volume is a significant advantage over having to restore a backup, purely in terms of wall-clock time." Metal trades volume-reattach-in-minutes for backup-restore-in-hours — but this matters only for the single-replica-loss case; the multi-replica cases (loss of write availability, data loss) dominate the durability math and favour Metal's independent-failure-domain shape.

  • IOPS-per-dollar table canonicalises Metal's price-performance moat. r6a + EBS configurations at xlarge/2xlarge/4xlarge scale range 0.84 to 13.2 IOPS/$ on-demand; i4i (local NVMe) sits at 58.41-58.50 IOPS/$ uniformly across all three sizes. That's 13-17× on-demand price-performance at 4xlarge (58.50 vs 3.31-4.45 on provisioned-IOPS EBS). Plus: "Amazon EBS cannot be discounted by either Reserved Instances or Savings Plans but the instance storage that comes with the instances PlanetScale Metal uses can be." Metal's discount runway is structurally larger than EBS's — EBS Reserved pricing does not exist.

  • Workload-shape guidance: who Metal is for. "PlanetScale Metal is for high-I/O databases. It's for the most demanding, most critical workloads. It's for databases where microseconds matter." Explicit use cases: (a) "Constant, wide-ranging, random reads" — random reads are the worst case on EBS (concepts/sequential-vs-random-io); (b) "Working sets that don't fit into the InnoDB buffer pool" — Metal can replace memcached or a fleet of read replicas when the working set overflows RAM (concepts/working-set-memory); (c) "Massive write throughput? Replicas can't even keep up with the primary?" — Metal's higher IOPS ceiling reduces sharding pressure; (d) "Low tolerance for high latency" — p99 reduction propagates to user-visible latency.

  • Hardware-evolution argument: "Hardware is really good now." "The magnetic hard drives that were common when MySQL earned its production stripes could do maybe hundreds of IOPS. They were I/O-bound if you so much as looked at them funny. SSDs, first connected via SATA, then SAS, and nowadays NVMe, changed the equation. I/O latency is lower now because SSDs don't need to seek and because the interconnects have gotten faster, too. Throughput is higher because the interconnects have higher bandwidth. And that's really the secret of PlanetScale Metal: Hardware is really good now. The rest is PlanetScale doing everything it takes to let that hardware shine." This is the philosophical frame for Metal: the historical reasons to pay the network-storage tax (HDD-era "storage is slow anyway; the network hop is free" logic) are obsolete when local NVMe ships 220-400k IOPS at μs-scale latency.

Systems / concepts / patterns surfaced

Operational numbers

  • EBS IOPS ceilings: r6i.4xlarge + EBS → 40,000 IOPS ("if the volume or volumes can keep up"). "Some EC2 instance types require striping multiple EBS volumes to achieve their maximum performance."
  • Local-NVMe IOPS ceilings: i4i.4xlarge220,000 random write IOPS or 400,000 random read IOPS using local NVMe SSDs. Read/write asymmetry noted verbatim; the asymmetry is an NVMe hardware property (writes incur erase + program cycles vs reads' pure read operations).
  • IOPS ratio Metal-vs-EBS on the same vCPU class: ~5.5× writes, ~10× reads on comparable .4xlarge instances.
  • EBS volume-type price spread: "$80 per TB for the slowest configuration to $2,573 per TB for the highest-performance EBS io2 volumes most instances can support"~32× price spread across EBS volume types.
  • EBS provisioned-IOPS on-demand threshold: "A high-performance network-attached storage volume capable of even 20,000 IOPS usually costs more than the virtual machine it's attached to."
  • Production migration datum: million-QPS workload, EBS I/O latency ~1ms → Metal NVMe latency ~microseconds, p99 query latency 9ms → 4ms (~56% reduction).
  • IOPS / $ (on-demand) table (AWS, r6a + EBS vs i4i + local NVMe):
Config xlarge 2xlarge 4xlarge
r6a + EBS gp3 (3,000 IOPS) 3.35 1.68 0.84
r6a + EBS gp3 (16,000 IOPS) 13.2 7.57 4.11
r6a + EBS io2 (20,000 IOPS) 3.80 3.18 2.40
r6a + EBS io2 (40,000 IOPS) 4.45 3.99 3.31
i4i + instance storage 58.41 58.48 58.50

Note that r6a + gp3 (16k) at xlarge (13.2 IOPS/$) is the closest EBS configuration to Metal's IOPS/$ — still ~4.4× worse price-performance than Metal. At 4xlarge scale, Metal's advantage widens to 13-17×. - Discount runway: "Amazon EBS cannot be discounted by either Reserved Instances or Savings Plans but the instance storage that comes with the instances PlanetScale Metal uses can be." — Metal's effective price-performance advantage widens under RI/SP commitments because EBS has no equivalent discount vehicle. - Durability math assumptions: (a) 1% EC2 instance failure rate within 30 days ("far more often than we observe in production"); (b) 5 minutes to detach EBS, launch new instance, attach volume ("on the fast side of fair"); (c) 5 hours to restore a backup ("wildly conservative compared to restore times we see in production for even terabyte-scale databases"). - Durability math results: (a) Write-availability loss (lose 2 of 3 replicas within MTTR) ≈ 0.000001% monthly probability; (b) Data loss (lose all 3 replicas) ≈ 0.00000000003% monthly probability. Independent-failure assumption — correlated AZ/instance-type/software-bug failure not in the formula. - Backup cadence: "PlanetScale databases, Metal or not, are backed up at least daily." Every backup is tested by restoring it and booting MySQL (patterns/backup-restore-tested-periodically, implicit canonical instance). - Cluster shape: primary + two replicas distributed across three availability zones within a cloud region (semi-sync ack after two AZs have persisted the write). Same primary + 2 replicas shape as earlier Metal posts. - Cloud availability: "PlanetScale Metal is available today in both AWS and GCP."

Caveats

  • Durability math is illustrative, not audited. The 1%-monthly-independent-failure model is a teaching construct, explicitly called out as "unfair to Metal" (failures are rarer in practice) but it also assumes independent failures. Correlated failures — AZ-wide power events, instance-type retirements, software bugs in the replication path — are not in the formula. The 0.00000000003% data-loss figure should be read as "under independent-failure assumptions", not as a bound on all failure modes.
  • Post is a launch-with-architecture, not a production-incident retrospective. Crowley writes from the pitch side: Metal's own failure modes (local-NVMe drive failure rates on i4i / im4gn, instance-termination handling, noisy-neighbour behaviour on EC2 shared hardware, reparent consistency on Metal) are not architecturally detailed. The i4i.4xlarge IOPS headline assumes the drive is healthy; drive-failure rate on local NVMe is not cited. The production p99 migration datum (9ms → 4ms on million-QPS workload) is a single case study — sample size 1.
  • Price-performance table is AWS-only. GCP pricing is not reproduced in the IOPS/$ table. "PlanetScale Metal is available today in both AWS and GCP" but the economics analysis on this page is AWS-specific.
  • Instance-type naming is MySQL-weighted. The r6i.4xlarge / i4i.4xlarge datum is MySQL / Vitess-era. The sources/2026-04-21-planetscale-benchmarking-postgres post (later) names i8g M-320 as the Metal-for-Postgres reference instance — the Crowley datum should be read as "AWS x86 family on 2025 pricing", not as the universal Metal SKU.
  • "Primary + two replicas" shape is cross-3-AZ but replication mode is semi-sync to 2 replicas, not a quorum. Semi-sync with rpl_semi_sync_master_wait_for_slave_count = 1 means a commit acks after one replica has persisted — not two. Crowley's "every write has reached stable storage in two availability zones before it's acknowledged" implies the primary's AZ + the acking replica's AZ count as "two AZs"; it does not mean two replicas have acked. See concepts/mysql-semi-sync-replication for the semi-sync contract and concepts/minority-quorum-writeability for the failover implications.
  • Backup-tested restore is stated but not quantified. "each and every backup taken is tested by actually restoring it and starting up MySQL" — canonical claim, but no datum on restore-success rate, restore-time distribution, or how failed restore-tests propagate back to alerting/remediation. The backup-restore testing pattern is load-bearing here; its mechanics are not disclosed.
  • Metal competes with itself on price below the high-I/O bar. "Some workloads that don't stress EBS (cached read-heavy, small-volume OLTP) don't see enough benefit to justify the pricing delta" — the post's own framing ("Metal is for high-I/O databases") acknowledges that workloads under 3,000 IOPS won't see Metal's economic advantage, because r6a + gp3 (3,000 IOPS) starts cheaper at xlarge scale even if IOPS/$ is worse.
  • Tier-3 product-launch framing. This is a Tier-3 vendor blog post with the architecture-launch shape — the post exists to sell Metal. It clears scope because the architectural content is >60% of body (numbers, instance-type pairings, pricing table, durability math, production migration datum). Marketing copy is factored out; load-bearing technical claims are reproduced verbatim with citation.

Cross-source continuity

This post is the architectural launch capstone of a trilogy + retrofit network on Metal across the wiki:

  • 2025-03-13 Ben Dicken — IO devices and latency — the latency argument (50 μs local NVMe vs 250 μs EBS, GP3 3,000-IOPS cap, durability-by-replication framing).
  • 2025-03-18 Nick Van Wiggeren — The real failure rate of EBS — the reliability argument (gp3 SLO = 14 min/day potential degradation, io2 correlated failure, 99.65% fleet-scale blast-radius probability).
  • 2025-03-11 (this post) Richard Crowley — No replacement for displacement — the economics + durability math capstone. Bundles the latency and reliability cases with the IOPS-per-dollar table and the probabilistic durability math. Most accessible single-article framing of Metal's architectural thesis.

Plus retrofit notes and adjacent canon: - 2024-08-19 Ben Dicken — Increase IOPS and throughput with sharding retrofits a link to Metal ("Since this article was written, we have released PlanetScale Metal…") — canonical wiki statement that Metal and sharding are two answers to the same IOPS-cost problem. - 2025-10-14 Ben Dicken — Benchmarking Postgres 17 vs 18 provides the vendor-agnostic empirical backing for the IOPS/$ table (i7i wins every scenario; io2 at $1,513.82/mo loses to i7i at $551.15/mo). - 2026-04-21 Multi-vendor Postgres benchmark names i8g M-320 as the Metal-for-Postgres AWS reference SKU, extending Crowley's AWS-x86 analysis to AWS-ARM-for-Postgres.

Crowley's post is the wiki's highest-leverage single-citation source for Metal's architectural thesis — the latency argument, the IOPS/throughput critique, the durability math, and the price-performance table all live in one piece. Dicken and Van Wiggeren's posts go deeper on single axes; Crowley's integrates.

Source

Last updated · 470 distilled / 1,213 read