Skip to content

ALLTHINGSDISTRIBUTED 2024-08-22 Tier 1

Read original ↗

Continuous reinvention: A brief history of block storage at AWS (Marc Olson, guest post on Werner Vogels' blog)

Summary

Marc Olson, a ~13-year veteran of the EBS team, narrates EBS's arc from a 2008 HDD-backed shared-disk service into a distributed SSD fleet doing >140 trillion operations/day, with single instances now getting more IOPS than entire AZs got in the HDD era. The post is a long retrospective on queueing theory, instrumentation, noisy-neighbor elimination, and the rolling stack of architectural changes — SSDs, Xen IO-path fixes, Nitro offload cards, SRD replacing TCP, org-shape "divide and conquer," and a human-effort hot-swap that stuffed an SSD into every existing storage server mid-flight. Also a first-person account of Olson realizing he had become his team's performance bottleneck and shifting to peer debugging / empowerment as a scaling lever.

Architecture (evolution)

  • 2008 launch — HDD-backed, shared-disks. Network-attached block storage for EC2. End-to-end latency 10s of ms, dominated by HDDs (~120-150 IOPS/drive, 6-8 ms avg, tail into the hundreds of ms). Network latency was "10s of microseconds" — a rounding error. Early mitigation for bad per-drive variance: spread customers across many disks → reduced peak outlier but spread inconsistency across more tenants → noisy-neighbor problem became critical.
  • 2012 — Provisioned IOPS on SSDs. New storage-server type, new volume type. 1,000 IOPS max, ~2-3 ms avg (10× IOPS, 5-10× latency improvement, far better outlier control). Shipping SSDs did not auto-fix noisy neighbors — it shifted the bottleneck up the stack (network + software).
  • 2012-13 — instrumentation push. Instrumented every IO at multiple points: client initiator, network stack, storage durability engine, OS. Added canary workloads running continuously so changes could be measured positive/negative.
  • 2013 — divide-and-conquer org-shape + architectural blueprint. Monolithic storage-server team decomposed into small component-focused teams (replication, durability, snapshot hydration). systems/physalia-style move: remove the control plane from the IO path (whiteboard shown in the post).
  • 2013 — SSD retrofit into the existing HDD fleet. See patterns/hot-swap-retrofit. SSDs tucked behind the motherboard with heat-resistant industrial hook-and-loop tape. Writes land on the SSD first (ack to the app), flushed async to HDD. "Converting a propeller aircraft to a jet while it was in flight."
  • 2013+ — Xen IO-path teardown. Loopback isolation of each queue layer found that Xen's default ring-queue parameters (inherited from Cambridge's early storage hardware config) capped the host at 64 outstanding IOs total, across all devices — a host-wide noisy-neighbor source. Fixed by tuning; then began the larger move off Xen for IO.
  • ~2013-17 — Nitro offload cards. First card: VPC network processing moved from Xen dom0 → dedicated hardware pipeline. Second card: EBS storage processing + hardware-accelerated EBS encryption (key material isolated from the hypervisor). Hypervisor queue layers collapsed; CPU no longer stolen from customer instances for IO.
  • 2014+ — SRD (Scalable Reliable Datagram) replaces TCP for storage traffic. Key observations: (1) we own the data-center network, we don't need internet-generality; (2) in-flight storage IOs can be reordered — barriers handled at the client before the network. SRD uses multiple paths, lets requests arrive out-of-order, recovers/reroutes around failures. Same protocol later offered as systems/ena-express (ENA Express) to accelerate guest TCP.
  • Later — systems/aws-nitro-ssd (custom SSDs). The team now builds its own SSDs with a stack tailored to EBS's needs.
  • Today — io2 Block Express. Sub-ms IO operations (from >10 ms avg in 2008); delivered without a cutover — long-running volumes have lived on hundreds of servers across generations.

Numbers to remember

  • 140 trillion ops/day across the distributed SSD fleet today.
  • HDD physics ceiling: ~120-150 IOPS/drive, 6-8 ms avg, tail into hundreds of ms.
  • Aug 2012 Provisioned IOPS launch: 1,000 IOPS max, ~2-3 ms avg — 10× IOPS, 5-10× latency vs HDD volumes.
  • Xen default: 64 outstanding IO requests per host, not per device — a scaling accident inherited from Cambridge's 2000s-era lab storage.
  • Latency arc: >10 ms avg IO (2008) → sub-ms consistent (io2 Block Express today).
  • Live volumes: some EBS volumes still active from the first few months of 2008 have been migrated across hundreds of servers and multiple hardware generations with zero customer-visible disruption.

Key takeaways

  1. Noisy neighbor is the central quality problem in multi-tenant storage. Spreading a noisy tenant across many disks reduces their worst case but widens the blast radius; early EBS learned this the hard way. Strong performance isolation (not just averages) is what customers actually pay for. See concepts/noisy-neighbor, concepts/performance-isolation. (Source: body, "Queueing theory, briefly".)

  2. If you can't measure it, you can't manage it. 2012 EBS had only rudimentary telemetry; the turnaround started by instrumenting every IO at every layer plus canary workloads for continuous regression detection. This is what made every subsequent optimization falsifiable. See patterns/full-stack-instrumentation. (Source: body, "If you can't measure it…")

  3. SSDs don't auto-fix distributed storage — they move the bottleneck. Replacing HDDs with SSDs collapsed disk latency but spotlighted queues, the network, and the hypervisor as the new hot spots. "We thought dropping SSDs in would solve almost all our problems… noisy neighbors weren't automatically fixed." (Source: body, "Set long term goals…")

  4. Incrementalism beats big-bang rewrites. The 2013 architectural blueprint "ended up looking nothing like what EBS looks like today" — but it gave direction. The org committed to small, observable, reversible changes and to shipping customer value along the way. See concepts/incremental-delivery. (Source: body, "Set long term goals…" and "Divide and conquer".)

  5. Org design is software design. Amazon's "small teams owning well-defined APIs" approach was applied to a data plane, not just retail microservices: the monolithic EBS storage server was decomposed into replication / durability / snapshot-hydration teams that could iterate and deploy independently. Cohorts of cross-stack experts were also stood up (storage server + client + EC2 hypervisor + AWS-wide network perf). Reference cited: Introduction to Algorithms (CLR). (Source: body, "Divide and conquer".)

  6. Always question your assumptions — the biggest wins are defaults. The Xen ring-queue defaults capping the whole host at 64 outstanding IOs were inherited from the Cambridge lab's 2000s hardware. Nobody questioned them for years. Finding them came from loopback-isolating each queue layer and measuring interference (see patterns/loopback-isolation). (Source: body, "Always question your assumptions!")

  7. Hardware offload is a queue-reduction strategy. Nitro moved VPC and then EBS processing off the hypervisor onto dedicated cards. This (a) removed several OS queues from the IO path; (b) stopped stealing CPU from customer instances; (c) let EBS encryption run at line rate with key material isolated from the hypervisor. See concepts/hardware-offload, systems/nitro. (Source: body, "Always question your assumptions!")

  8. TCP is not the right protocol for in-data-center storage. systems/srd drops TCP's strict in-order delivery (storage IOs can be reordered — barriers resolved at the client), uses multiple network paths, and is offload-friendly. Counterintuitive earlier finding the post mentions: adding small random latency to storage requests reduced average and outlier latency by smoothing the network. (Source: body, "Always question your assumptions!")

  9. "Tail at scale" forced this design family. EBS touches the whole fleet on failure recovery and data replication; variance anywhere shows up as customer-visible jitter. The team's attention to outliers — hedging, smoothing, network path diversity, offload, eventually custom SSDs — maps cleanly onto the tail-at-scale framing. See concepts/tail-latency-at-scale. (Source: body — the post's whole arc is about outliers, not averages.)

  10. Constraints breed innovation — the SSD hot-swap. Rather than field-replace thousands of servers (too expensive), EBS identified the one empty spot in the chassis that didn't break airflow (between motherboard and fans), used industrial hook-and-loop tape, and physically taped an SSD into every existing storage server over a few months in 2013. Software staged writes to the SSD (ack-on-SSD, async flush to HDD). Zero customer disruption because the system was designed from day one for non-disruptive maintenance events — retarget volumes, update software, rebuild empty servers in place. See patterns/hot-swap-retrofit, patterns/nondisruptive-migration. (Source: body, "Constraints breed innovation".)

  11. Non-disruptive migration is a compounding asset. Because EBS could move volumes between servers and HW generations without customer visibility, the same mechanism paid off across many later upgrades (data-layout changes, new storage-server types, Nitro offload, custom SSDs). Volumes from Aug 2008 are still live after crossing hundreds of servers. (Source: body, "Constraints breed innovation".)

  12. Scaling people is a different problem from scaling systems. Olson describes a pivot: he was personally on every escalation and every CR, becoming the org's perf bottleneck. Moving to peer-debugging sessions (shared-terminal, shared-systems-knowledge) and "remove roadblocks but leave guardrails" leadership surfaced a latent locking/jitter bug the group caught together. See patterns/peer-debugging. (Source: body, "Reflecting on scaling performance".)

Caveats

  • This is a retrospective essay, not a design paper. Specific internals (storage-server replication protocol, Physalia-vs-classic control plane boundary, Crossbar-equivalents for EBS, exact SRD wire format) are gestured at, not described. Pair with:
  • The 2023 Andy Warfield piece on HDD physics at scale (linked in the post).
  • "A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC" (2020) for SRD.
  • AWS's reinventing-virtualization post on Nitro for the offload-card family context.
  • The SSD-retrofit story is a photograph-with-captions — no wear-leveling, capacity planning, or SSD-lifetime analysis is given. The design clearly assumed the SSDs were a staging tier, not a replacement.
  • No Physalia details here; EBS's control-plane-out-of-the-IO-path story is represented by a whiteboard photo only.
  • The "adding small random latency reduces average latency" claim is stated as a tuning experience, not an analyzed result; treat as a hint, not a recipe.
  • Timing: the earlier sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years and this post both frame AWS-wide moves ("Nitro offload," "custom silicon") as the same arc — EBS was an early beneficiary, not the origin, of the Nitro design family.
Last updated · 200 distilled / 1,178 read