Skip to content

PLANETSCALE 2025-03-13

Read original ↗

PlanetScale — IO devices and latency

Summary

PlanetScale's Ben Dicken publishes a pedagogical history of non-volatile storage devices — tape → HDD → SSD → cloud network-attached storage — framed as a sequence of latency step-changes punctuated by one regression: the move to cloud-era network-attached storage (EBS and similar) that a modern cloud-database deployment is built on, which pays an order-of-magnitude round-trip penalty versus a locally attached NVMe SSD. The post is published in celebration of PlanetScale Metal, a new product tier that runs each database instance on a directly-attached NVMe drive in a primary + two-replica topology, inverting the industry default of network-attached storage (EBS on Amazon Aurora, Amazon RDS, Google Cloud SQL, and — previously — PlanetScale itself). The article walks through the physical mechanics of each medium: tape cartridges with hundreds of meters of magnetic tape over a single read/write head (seconds-scale random access, excellent sequential throughput); spinning HDD platters with ~100k tracks per platter and a single seek-able head (1–3 ms random reads, sequential-friendly); SSDs built from NAND flash transistors organised as targets → blocks → pages with per-page read/write + per-block erase semantics (16μs random reads, no mechanical seek); and the hidden-but-load-bearing SSD parallelism (dedicated data lines per target, only one page in flight per line) and garbage collection (dirty pages can't be overwritten; whole blocks must be erased after compacting live pages off them). The post closes with a cloud-era storage-latency table:

Hop Round-trip
CPU → RAM ~100 ns
CPU → local NVMe SSD ~50,000 ns (50 μs)
CPU → network-attached SSD (EBS) ~250,000 ns (250 μs)

A 1,000× gap between RAM and local NVMe; a 5× gap between local NVMe and network-attached EBS. On top of the latency penalty, cloud network-attached storage throttles IOPS (GP3 EBS default 3,000 IOPS/volume, GP2 pool-and-burst). PlanetScale's position: replication-based durability + active capacity monitoring can close the durability + elasticity gap that drove the industry to network-attached storage in the first place, without paying the latency and IOPS penalty — hence Metal.

Key takeaways

  1. Tape is still alive as an archive tier. "CERN has a tape storage data warehouse with over 400 petabytes of data under management. AWS also offers tape archiving as a service." Tape's durability + cost/GB + shelf-life still beats SSD and HDD for cold, massive, infrequently-read data, despite 10s-of-seconds random-access latency. (Source: article §"Tape".)

  2. Random HDD reads take 1–3 ms. The platter spins at ~7200 RPM and the head must physically seek to the correct track. "A typical random read can be performed in 1-3 milliseconds." HDDs dominate over tape because "the entire surface area of the bits is available 100% of the time" — no need to unwind a cartridge. (Source: article §"Hard disk drive".) The post's numbers align with the existing hard- drive-physics wiki page's 120-IOPS-per-drive flat ceiling from Warfield's FAST'23 framing.

  3. SSD random reads take ~16 μs — roughly 100× faster than HDD. "A random read on an SSD varies by model, but can execute as fast as 16μs (μs = microsecond, which is one millionth of a second)." The step-change comes from removing all mechanical motion: "all data is read, written, and erased electronically using a special type of non-volatile transistor known as NAND flash. This means that each 1 or 0 can be read or written without the need to move any physical components, but 100% through electrical signaling." (Source: article §"Solid State Drives".)

  4. SSDs are organised as targets → blocks → pages, with asymmetric read/write/erase semantics. "SSDs are organized into one or more targets, each of which contains many blocks which each contain some number of pages. SSDs read and write data at the page level, meaning they can only read or write full pages at a time. […] After a page is written to, it cannot be overwritten with new data until the old data has been explicitly erased. The tricky part is, individual pages cannot be erased. When you need to erase data, the entire block must be erased." Concrete capacity arithmetic: "say each page holds 4096 bits of data (4k). Now, say each block stores 16k pages, each target stores 16k blocks, and our device has 8 targets. This comes out to 4k * 16k * 16k * 8 = 8,796,093,022,208 bits, or 8 terabytes." This three-level hierarchy + the page/block asymmetry is captured in concepts/nand-flash-page-block-erasure. (Source: article §"Solid State Drives" + §"Data layout".)

  5. SSD parallelism is a layout problem. "Typically, each target has a dedicated line going from the control unit to the target. This line is what processes reads and writes, and only one page can be communicated by each line at a time." Spread 8 writes across 4 targets → 2 time slices in parallel; write all 8 to one target → 8 sequential slices on one line while "all the other lines sat dormant." "This demonstrates that the order in which we read and write data matters for performance. Many software engineers don't have to think about this on a day-to-day basis, but those designing software like MySQL need to pay careful attention to what structures data is being stored in and how data is laid out on disk." Canonical wiki instance of concepts/ssd-parallelism-via-targets. (Source: article §"Parallelism".)

  6. SSD garbage collection pays the hidden bill. Dirty pages accumulate; to reclaim them the drive must move live pages off the block, then erase the block. "When SSDs have a lot of reads, writes, and deletes, we can end up with SSDs that have degraded performance due to garbage collection. Though you may not be aware, busy SSDs do garbage collection tasks regularly, which can slow down other operations." This is the hidden latency tax that makes SSD write performance non-deterministic under load. See concepts/ssd-garbage-collection. (Source: article §"Garbage collection".)

  7. Cloud went backward on storage latency. "A round trip from the CPU to a locally-attached NVMe SSD takes about 50,000 nanoseconds (50 microseconds). […] Read and write requires a short network round trip within a data center. The round trip time is significantly worse, taking about 250,000 nanoseconds (250 microseconds, or 0.25 milliseconds). Using the same cutting-edge SSD now takes an order of magnitude longer to fulfill individual read and write requests." The default cloud-database architecture (Aurora, RDS, Cloud SQL, prior PlanetScale) is built on this 5× slower hop. Canonicalised as concepts/network-attached-storage-latency-penalty. (Source: article §"Moving to the cloud".)

  8. Cloud also caps IOPS artificially. "Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. By default, a GP3 EBS instance on Amazon allows you to send 3000 IOPS per-second. This can be configured higher, but comes at extra cost. The older GP2 EBS volumes operate with a pool of IOPS that can be built up to allow for occasional bursts." Direct-attached NVMe has no such cap"you can read and write as fast as the hardware will allow for." See concepts/iops-throttle-network-storage. (Source: article §"Moving to the cloud".)

  9. Replication closes the durability gap that drove the move to network storage. "Say in a given month, there is a 1% chance of a server failing. With a single server, this means we have a 1% chance of losing our data each month. […] However, with three servers, this goes down to 1% × 1% × 1% = 0.0001% chance (1 in one million). At PlanetScale the protection is actually far stronger than even this, as we automatically detect and replace failed nodes in your cluster. We take frequent and reliable backups of the data in your database for added protection." This is the thesis argument for why direct-attached storage is now viable again in the cloud: replication + automated failover + frequent backups solves the "server goes down, data gone" fear that made network-attached storage the default. See concepts/storage-replication-for-durability + patterns/direct-attached-nvme-with-replication. (Source: article §"How do we overcome issue 1".)

  10. Metal is the product embodiment. "With Metal, you get a full-fledged database cluster set up (Vitess or Postgres), with each database instance running with a direct-attached NVMe SSD drive. Each Metal cluster comes with a primary and two replicas by default for extremely durable data. We allow you to resize your servers with larger drives with just a few clicks of a button when you run up against storage limits. Behind the scenes, we handle spinning up new nodes and migrating your data from your old instances to the new ones with zero downtime. Perhaps most importantly, with a Metal database, there is no artificial cap on IOPS." Canonical wiki instance: systems/planetscale-metal. (Source: article §"PlanetScale Metal".)

Storage-latency hierarchy (article's key table)

Tier Typical round-trip Notes
CPU ↔ RAM ~100 ns Volatile; expensive by the GB.
CPU ↔ local NVMe SSD ~50 μs (50,000 ns) "As fast as it gets for modern storage."
CPU ↔ network-attached SSD (EBS) ~250 μs (250,000 ns) Same underlying SSD; ~5× slower due to network hop.
CPU ↔ random HDD read ~1,000–3,000 μs (1–3 ms) Head seek + platter rotation.
CPU ↔ random tape read ~1,000,000–10,000,000 μs (1–10 s) Unwind cartridge to position.

Canonicalised on the wiki as concepts/storage-latency-hierarchy.

Architectural numbers

  • SSD capacity example: 4 KB page × 16k pages/block × 16k blocks/target × 8 targets ≈ 8 TB.
  • HDD tracks: "a single disk will often have well over 100,000 tracks. Each track contains hundreds of thousands of pages, and each page containing 4k (or so) of data."
  • HDD rotation: "7200 RPM is common, for example."
  • HDD random read: 1–3 ms.
  • SSD random read: ~16 μs.
  • RAM round-trip: ~100 ns.
  • Local NVMe round-trip: ~50 μs.
  • EBS round-trip: ~250 μs.
  • GP3 EBS default: 3,000 IOPS/volume (burst/boost via cost).
  • Durability math: 1% single-server monthly failure → replicated 3× → 1 in 1,000,000.
  • CERN tape archive: "over 400 petabytes" under tape.

Caveats

  • Numbers are illustrative, not measured. The latency values are rounded teaching numbers, not PlanetScale benchmarks — the article's own framing is pedagogical.
  • No Metal benchmarks published. The post announces Metal but doesn't disclose QPS / p99 / IOPS numbers against EBS- backed Aurora or RDS. Those live in the separate Announcing Metal post.
  • Network-attached storage framing ignores EBS's own engineering. EBS has spent a decade shrinking this gap (systems/nitro, systems/srd, systems/aws-nitro-ssd — see sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws); "an order of magnitude longer" is the comparison against average EBS, not io2 Block Express sub-ms. The direction is right, the magnitude is workload-specific.
  • Durability math is illustrative. 1%/month/server is a placeholder; real cloud MTBF varies by region, instance class, and age-of-hardware. The 3-server math is correct given independent failures; correlated failures (rack, power, network) change the answer — not discussed.
  • Tape coverage is historical/contextual. The article's treatment of tape is pedagogical, not architectural — PlanetScale does not ship tape.
  • No internal Metal architecture. Replication protocol, failover mechanism, cross-replica consistency model, and latency trade-offs of the three-node topology are all unspecified in this post.

Cross-source continuity

  • Complements Ben Dicken's prior sources/2024-09-09-planetscale-b-trees-and-database-indexes — that post established "primary-key choice determines the on-disk layout of every row"; this post zooms out to "the storage medium itself determines the latency floor every index design sits on top of." Both are Dicken's pedagogical-deep-dive voice on MySQL / InnoDB internals.
  • Complements the wiki's existing concepts/hard-drive-physics page — adds Dicken's 1–3 ms HDD random-read figure + the tape layer the physics page doesn't cover.
  • Complements the wiki's existing concepts/hdd-sequential-io-optimization page — the Dicken article restates the same "sequential is fast, random is slow" observation from a teaching-from-physics angle rather than the Kafka-on-HDD architectural angle.
  • Complements the 2024-08-22 AWS / Werner Vogels sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws / systems/aws-ebs page — provides the customer-side argument against EBS that Vogels's post answers from inside. The two posts together canonicalise the network-attached-vs-local-NVMe debate as a real architectural axis on the wiki, with two named positions: AWS's "close the gap with Nitro + SRD + custom SSDs" vs PlanetScale's "skip the gap via direct-attached NVMe + replication."
  • Complements Meta's sources/2025-03-04-meta-a-case-for-qlc-ssds-in-the-data-center on NAND-media economics — that post is the hyperscaler bandwidth/endurance-tier framing; this post is the application-database end-user framing of the same underlying NAND-flash physics.
  • Complements Dropbox's sources/2025-08-08-dropbox-seventh-generation-server-hardware on direct-attached storage + drive-physics realism — same thesis (disk physics still rules; architectures that pay attention to it win), different application domain (file sync service vs OLTP database).

Source

Last updated · 319 distilled / 1,201 read