Skip to content

SYSTEM Cited by 1 source

Amazon EBS (Elastic Block Store)

Definition

Amazon Elastic Block Store (EBS) is AWS's network-attached block-storage service for EC2 instances — launched Aug 20, 2008, ~two years after EC2. EBS's primary workload is system disks for EC2 instances (not object archival), so its design goal is an unusual three-way balance of durability + performance + availability: because EC2's runtime experience tracks EBS's, EBS outliers become EC2 outliers become customer-application outliers.

Today EBS runs a distributed SSD fleet doing >140 trillion operations/day, and a single EC2 instance can consume more IOPS than an entire Availability Zone received in the HDD-era launch.

Evolution (as told by Marc Olson)

Era Media Notable Typical latency Notes
2008 launch HDDs (shared) 10s of ms avg Net latency was a rounding error; HDD physics ruled.
Aug 2012 SSDs (new server type + new volume type: Provisioned IOPS) 1,000 IOPS max ~2-3 ms avg 10× IOPS, 5-10× latency vs HDD volumes.
2013 HDD servers retrofit with SSDs write-staging SSD + async HDD flush much improved writes See patterns/hot-swap-retrofit.
2013-17 First systems/nitro offload card (network), then second (EBS + encryption) hypervisor queues removed from IO path falling CPU no longer stolen for IO; hardware-isolated encryption keys.
2014+ systems/srd replaces TCP for storage traffic multi-path, out-of-order, offload-friendly tighter outliers Same protocol later offered as systems/ena-express.
Later systems/aws-nitro-ssd (custom SSD) Stack tailored specifically to EBS.
Today (io2 Block Express) Custom SSDs + Nitro + SRD hundreds of thousands of IOPS / instance sub-ms consistent Up from >10 ms in 2008.

Design themes

  • Noisy-neighbor elimination is the product. EBS customers don't pay for average latency; they pay for isolation. Early EBS learned that spreading a noisy tenant across many spindles widens the blast radius instead of containing it. See concepts/noisy-neighbor, concepts/performance-isolation.
  • Layered queueing everywhere. OS kernel ↔ storage adapter ↔ storage fabric ↔ target adapter ↔ media. EBS's perf work has been, for 15+ years, a coordinated attack on every queue in that path.
  • Measure-first: no change ships without multi-point IO instrumentation (client, network stack, durability engine, OS) plus continuous canary workloads for regression detection. See patterns/full-stack-instrumentation.
  • Divide-and-conquer by both code and org. The monolithic storage server was split into replication / durability / snapshot-hydration teams that deploy independently under shared contracts; cross-org cohorts spanning storage-server, client, hypervisor, and network-perf drive stack-wide improvements.
  • Hardware offload as queue reduction. systems/nitro moves VPC and EBS processing off the hypervisor, collapsing OS queues and freeing customer CPU.
  • SRD instead of TCP for storage. Storage IOs can arrive out-of-order; barriers resolve client-side; multi-path uses full data-center fabric. See systems/srd.
  • Non-disruptive maintenance is a compounding asset. Because volumes can be migrated transparently between servers and HW generations, every future upgrade — data-layout changes, new SSD servers, Nitro, custom silicon — gets delivered in flight. Volumes from Aug 2008 are still live after hundreds of underlying-server moves. See patterns/nondisruptive-migration.

Signature hot-swap story (2013)

Rather than field-replace thousands of HDD storage servers, EBS taped an SSD into every existing one — using industrial heat-resistant hook-and-loop tape in the only chassis slot that didn't disturb airflow (between the motherboard and fans). Writes landed on SSD (ack-on-SSD to the application), async flush to HDD. Zero customer disruption. This is the prototype of "fix the fleet in place" engineering culture at AWS storage. See patterns/hot-swap-retrofit.

Xen default that cost EBS years

The Xen hypervisor's default ring-queue parameters, inherited from the Cambridge lab's 2000s-era storage hardware, capped each EC2 host at 64 outstanding IO requests total across all devices, not per device. It took loopback isolation of each queue layer to surface this. See patterns/loopback-isolation. A canonical "always question your assumptions" case.

Seen in

Last updated · 200 distilled / 1,178 read