Skip to content

SYSTEM Cited by 3 sources

Amazon EBS (Elastic Block Store)

Definition

Amazon Elastic Block Store (EBS) is AWS's network-attached block-storage service for EC2 instances — launched Aug 20, 2008, ~two years after EC2. EBS's primary workload is system disks for EC2 instances (not object archival), so its design goal is an unusual three-way balance of durability + performance + availability: because EC2's runtime experience tracks EBS's, EBS outliers become EC2 outliers become customer-application outliers.

Today EBS runs a distributed SSD fleet doing >140 trillion operations/day, and a single EC2 instance can consume more IOPS than an entire Availability Zone received in the HDD-era launch.

Evolution (as told by Marc Olson)

Era Media Notable Typical latency Notes
2008 launch HDDs (shared) 10s of ms avg Net latency was a rounding error; HDD physics ruled.
Aug 2012 SSDs (new server type + new volume type: Provisioned IOPS) 1,000 IOPS max ~2-3 ms avg 10× IOPS, 5-10× latency vs HDD volumes.
2013 HDD servers retrofit with SSDs write-staging SSD + async HDD flush much improved writes See patterns/hot-swap-retrofit.
2013-17 First systems/nitro offload card (network), then second (EBS + encryption) hypervisor queues removed from IO path falling CPU no longer stolen for IO; hardware-isolated encryption keys.
2014+ systems/srd replaces TCP for storage traffic multi-path, out-of-order, offload-friendly tighter outliers Same protocol later offered as systems/ena-express.
Later systems/aws-nitro-ssd (custom SSD) Stack tailored specifically to EBS.
Today (io2 Block Express) Custom SSDs + Nitro + SRD hundreds of thousands of IOPS / instance sub-ms consistent Up from >10 ms in 2008.

Design themes

  • Noisy-neighbor elimination is the product. EBS customers don't pay for average latency; they pay for isolation. Early EBS learned that spreading a noisy tenant across many spindles widens the blast radius instead of containing it. See concepts/noisy-neighbor, concepts/performance-isolation.
  • Layered queueing everywhere. OS kernel ↔ storage adapter ↔ storage fabric ↔ target adapter ↔ media. EBS's perf work has been, for 15+ years, a coordinated attack on every queue in that path.
  • Measure-first: no change ships without multi-point IO instrumentation (client, network stack, durability engine, OS) plus continuous canary workloads for regression detection. See patterns/full-stack-instrumentation.
  • Divide-and-conquer by both code and org. The monolithic storage server was split into replication / durability / snapshot-hydration teams that deploy independently under shared contracts; cross-org cohorts spanning storage-server, client, hypervisor, and network-perf drive stack-wide improvements.
  • Hardware offload as queue reduction. systems/nitro moves VPC and EBS processing off the hypervisor, collapsing OS queues and freeing customer CPU.
  • SRD instead of TCP for storage. Storage IOs can arrive out-of-order; barriers resolve client-side; multi-path uses full data-center fabric. See systems/srd.
  • Non-disruptive maintenance is a compounding asset. Because volumes can be migrated transparently between servers and HW generations, every future upgrade — data-layout changes, new SSD servers, Nitro, custom silicon — gets delivered in flight. Volumes from Aug 2008 are still live after hundreds of underlying-server moves. See patterns/nondisruptive-migration.

Signature hot-swap story (2013)

Rather than field-replace thousands of HDD storage servers, EBS taped an SSD into every existing one — using industrial heat-resistant hook-and-loop tape in the only chassis slot that didn't disturb airflow (between the motherboard and fans). Writes landed on SSD (ack-on-SSD to the application), async flush to HDD. Zero customer disruption. This is the prototype of "fix the fleet in place" engineering culture at AWS storage. See patterns/hot-swap-retrofit.

Xen default that cost EBS years

The Xen hypervisor's default ring-queue parameters, inherited from the Cambridge lab's 2000s-era storage hardware, capped each EC2 host at 64 outstanding IO requests total across all devices, not per device. It took loopback isolation of each queue layer to surface this. See patterns/loopback-isolation. A canonical "always question your assumptions" case.

Seen in

  • PlanetScale's canonical architectural launch-post critique of EBS (Richard Crowley, 2025-03-11). Complements Olson's inside-AWS story (close the gap), Dicken's latency critique (skip the gap via local NVMe), and Van Wiggeren's fleet-scale reliability critique with two new canonical datums: (a) the EBS volume-type price spread verbatim: "$80 per TB for the slowest configuration to $2,573 per TB for the highest- performance EBS io2 volumes most instances can support"~32× price spread across the EBS product ladder (cleaner single-citation form than Dicken's gp3-default + io2-premium split); (b) instance-type-paired IOPS ceilings: "an r6i.4xlarge EC2 instance … can perform 40,000 IOPS if the volume or volumes can keep up" vs "an i4i.4xlarge EC2 instance can perform 220,000 random write or 400,000 random read IOPS using local NVMe SSDs"5.5-10× IOPS ratio on the same vCPU class. Canonical network-physics framing of the throughput throttle: "Even at the very expensive upper end — EBS io2, for example — the network holds the storage hardware back" — i.e., the physical network fabric is the binding constraint even on io2, not just the administrative IOPS/throughput caps. Canonical pricing inversion: "a high- performance network-attached storage volume capable of even 20,000 IOPS usually costs more than the virtual machine it's attached to" — canonicalises that on provisioned-IOPS tiers, the volume costs more than the compute. Canonical Reserved-Instance asymmetry: "Amazon EBS cannot be discounted by either Reserved Instances or Savings Plans" — a pricing-structure datum for EBS. The post's IOPS/$ table sits at 0.84-13.2 IOPS/$ across on-demand r6a + EBS configurations (vs 58.4 IOPS/$ uniform on i4i + local NVMe), canonicalising EBS's price-performance disadvantage as a durable quantitative datum.

  • sources/2026-04-21-planetscale-increase-iops-and-throughput-with-shardingCanonical customer-facing pricing-pedagogy view of EBS's volume-type taxonomy. Ben Dicken (PlanetScale, 2024-08-19) canonicalises (a) the gp3 single-IOP semantics ("a single operation is measured as a one 64 KiB disk read or write"); (b) the independent IOPS + throughput caps (gp3 default 3,000 IOPS / 125 MiB/s, max 16,000 / 1,000 MiB/s; io2 max 256,000 / 4,000 MiB/s); (c) the sequential vs random I/O accounting ("For sequential reads, EBS will bundle requests together … For random reads, each read counts as a full IOP, even if it is less than 64k. … a single random read of a 4k block from disk will count as a full 64k IOP") — with the counterintuitive caveat that "this applies to workloads on both HDDs and SSDs on EBS"; (d) the burst bucket model ("EBS volumes also allow you to bank unused IOPS, up to a fixed limit" — canonical token-bucket-at-the-I/O-layer formulation). The architectural frame is pricing: the gp3 default is cheap, but once a workload crosses the gp3 ceiling the regime-shift to io1 / io2 provisioned-IOPS produces a super-linear cost multiplier (8× workload → 11-13× cost on RDS with io1). Canonical wiki statement that EBS's per-volume IOPS + throughput caps function as pricing staircases, not gradients — and that horizontal sharding is the architectural lever that keeps each shard below the premium-tier threshold (8× PlanetScale sharded = linear $13,992/mo vs unsharded RDS's $20,520-$24,197/mo). Complementary to the existing Olson / Dicken / Van Wiggeren perspectives on this page — they canonicalise EBS's latency and reliability from inside (Olson) and outside (Dicken IO-devices, Van Wiggeren failure-rate); this post canonicalises EBS's pricing shape from the customer side.

  • sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — Marc Olson, 13-year insider retrospective: queueing theory, HDD→SSD, Xen teardown, Nitro, SRD, SSD retrofit, org-shape, scaling people.

  • sources/2025-03-13-planetscale-io-devices-and-latencyexternal customer-side critique of the EBS architecture. Ben Dicken frames the network-attached-storage model as a ~5× latency regression over local NVMe ("about 250,000 nanoseconds (250 microseconds)" for EBS vs "about 50,000 nanoseconds (50 microseconds)" for local NVMe) plus a default 3,000 IOPS/volume cap on GP3 + pool-and-burst on GP2. The Dicken + Olson pair canonicalise the two-side debate on network-attached block storage: AWS's "close the gap with Nitro + SRD + Nitro SSDs + io2 Block Express" vs customer-side "skip the gap by going back to local NVMe + replication" (systems/planetscale-metal, patterns/direct-attached-nvme-with-replication). The direction is right; the average vs io2 magnitude is workload-specific. See concepts/network-attached-storage-latency-penalty + concepts/iops-throttle-network-storage for the generalised framings.

  • sources/2025-03-13-planetscale-io-devices-and-latency — PlanetScale (Ben Dicken) external-customer framing of EBS's network- attached latency penalty — ~50 μs local NVMe vs ~250 μs EBS round-trip, and the GP3 3,000-IOPS default cap as a throttle that direct-attached NVMe doesn't carry. Canonical customer-side argument that motivates PlanetScale Metal — the shared-nothing-on-local-NVMe architecture.

  • — PlanetScale (Nick Van Wiggeren) fleet-scale reliability view of EBS. Loads:

    • The gp3 SLO literally: "at least 90% of provisioned IOPS 99% of the time in a given year" — canonicalised as the wiki's load-bearing performance- variance datum (14 min/day or 86 h/year of potential degraded operation).
    • The fleet-scale blast-radius multiplier: 256 shards × 3 replicas = 768 gp3 volumes → 99.65% chance of at least one active impacting event at any given moment, under stated assumptions. See concepts/blast-radius-multiplier-at-fleet-scale.
    • io2 is not immune: "roughly one third of the time in any given year" on the same fleet.
    • Correlated-AZ-failure observed on io2 too, contradicting the naive "replicas on different volumes in same AZ → fate independence" assumption. See concepts/correlated-ebs-failure.
    • Customer-side mitigation stack: PlanetScale runs automated volume-health monitoring (read/write latency + idle % + write-file smoke test) + [[patterns/zero-downtime-reparent- on-degradation|zero-downtime reparent in seconds]] to a healthy node + automated replacement-volume provisioning.
    • Structural fix: Metal — shared-nothing direct-attached NVMe. Complementary to the AWS-side story on this page: Marc Olson's post is "close the gap", PlanetScale's is "skip the gap".
  • Benchmark-measured EBS behaviour across gp3-3k / gp3-10k / io2-16k vs a local-NVMe i7i.2xlarge on Postgres 17 and Postgres 18 (sync / worker / io_uring) under sysbench oltp_read_only. Empirical findings load-bearing for this page: (a) at 50 connections IOPS and throughput are the dominant bottleneck for each EBS variant — Dicken: "IOPS and throughput are clear bottlenecks for each of the EBS-backed instances. The different versions / I/O settings don't make a huge difference in such cases" — QPS scales in lockstep with the EBS capability tier; (b) the local-NVMe i7i consistently beats every EBS option across every Postgres config and concurrency level, often by a wide margin — workload-side empirical backing for the 2025-03-13 IO-devices post's ~50 μs vs ~250 μs latency-hop framing; (c) io2-16k costs $1,513.82/mo vs i7i's $551.15/mo for the same vCPU/RAM tier — io2 loses the price-performance comparison on this workload, supporting the PlanetScale Metal thesis with vendor-agnostic EC2 pricing. Postgres 18 async I/O does not rescue EBS — neither worker nor io_uring closes the gap to local NVMe. See concepts/iops-throttle-network-storage for the throttle concept + concepts/async-io-concurrency-threshold for why io_uring in particular underperforms on EBS at low concurrency.

  • sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10EBS retirement story: Yelp Nrtsearch 1.0.0 moves the primary off EBS onto ephemeral local SSD. Three load-bearing pre-1.0 EBS drawbacks are enumerated in the post: (1) EBS was the source of truth for committed data — EBS loss = full reindex; (2) "EBS movement was not as smooth as expected … the EBS volume would not be correctly dismounted from the old node, and then the new node would take some time to mount it"; (3) ingestion-heavy clusters needed frequent full backups "so that replicas did not have to spend too much time catching up with the primary after downloading the index." All three are resolved by shifting durability to S3 via incremental-backup- on-commit and bootstrapping replicas via parallel S3 download to local SSD. Canonical wiki datum for EBS's mount-boundary unreliability as a real operational concern at primary-restart time.

  • 2023-era "EBS is durable enough" framing that the 2025-era PlanetScale posts subsequently reframe. Sam Lambert (PlanetScale CEO, 2023-06-28) names EBS as the tier-4 layer in PlanetScale's seven-layer data-safety envelope. Verbatim: "we mount the MySQL data volume on cloud block storage, such as Amazon Web Services (AWS) Elastic Block Store (EBS) and Google Cloud Persistent Disk (GCPD), which are designed to be highly durable and reliable. Both EBS and GCPD use data replication to ensure that data is stored redundantly across multiple drives … self-healing, meaning they can detect and repair data inconsistencies automatically without user intervention." Canonical wiki framing: EBS as a trusted substrate primitive whose multi-drive replication and self-healing property the database layer above relies on. Framing shift with 2025-era PlanetScale posts: Dicken (2025-03-13 I/O devices and latency), Van Wiggeren (2025-03-18 Real failure rate of EBS), and Crowley (2025-03-11 PlanetScale Metal launch) all invert this 2023-era trust-the-substrate framing, arguing EBS is less reliable than Lambert's 2023 claim and motivating local-NVMe-with- application-layer-replication ( PlanetScale Metal). Both positions coexist — the 2023 frame remains accurate for the default PlanetScale tier (EBS- backed); the 2025 posts establish that a higher-performance and higher-reliability tier needs a different substrate. The self-healing framing in particular is the novel 2023 datum that the 2025 posts subsequently revise by pointing at partial-failure / tail-latency-degradation modes that the self-healing doesn't catch.

Last updated · 542 distilled / 1,571 read