ALLTHINGSDISTRIBUTED 2025-02-25 Tier 1

Building and operating a pretty big storage system called S3¶

Author: Andy Warfield (VP / Distinguished Engineer, S3), guest post hosted by Werner Vogels on All Things Distributed. Based on Warfield's USENIX FAST '23 keynote.

Raw: raw/allthingsdistributed/2025-02-25-building-and-operating-a-pretty-big-storage-system-called-s3-5b84cf78.md

Summary¶

Andy Warfield's FAST keynote on three lenses into S3's scale: (1) the physics of storing on millions of hard drives — HDDs' capacity-growth vs. seek-time-flatline, and the I/O-per-TB ceiling that forces "heat management" as a first-class placement problem; (2) the human/organizational scale — S3 is hundreds of microservices, "AWS ships its org chart," and durability is maintained through explicit mechanisms like durability reviews (threat-model for your change) and lightweight formal verification (executable spec of ShardStore, S3's rewritten per-disk layer, ~1% the size of the real code, tested against it on every commit); (3) the individual scale — Warfield's shift from university professor / startup founder to senior Amazon engineer, framed through Amazon's ownership tenet: teams (and people) go faster when they own their problems end-to-end, and a senior engineer's leverage comes from articulating problems, not dispensing solutions. Together: S3 is "software + hardware + people," and every layer is a scaling variable.

Key takeaways¶

HDD physics is the hard constraint in S3's architecture. Capacity has grown 7.2M× (1956 RAMAC → 26 TB Ultrastar 2023) and is on track to 200 TB/drive this decade, but seek time has improved only 150×. A fully-random-access drive delivers about 120 IOPS — and that number has been flat since before S3 launched in 2006. At 200 TB/drive, that's 1 I/O per second per 2 TB. S3's placement and replication strategies are fundamentally about hiding this ratio from customers.
Heat management is the central placement problem. "Heat" = requests per drive per unit time. Hotspots don't cause failures — they cause queueing, and that queueing amplifies through every layer (metadata lookups, erasure-coding reconstructs) into tail latency. "Hotspots at individual hard disks create tail latency, and ultimately, if you don't stay on top of them, they grow to eventually impact all request latency." See concepts/heat-management, concepts/tail-latency-at-scale.
Aggregation of millions of workloads flattens peak demand. Individual storage workloads are bursty (mostly idle, occasional large peaks). But "as we aggregate millions of workloads a really, really cool thing happens: the aggregate demand smooths and it becomes way more predictable. … once you aggregate to a certain scale you hit a point where it is difficult or impossible for any given workload to really influence the aggregate peak at all." This is the math behind scale being a quality lever, not just a cost lever. See concepts/aggregate-demand-smoothing.
Redundancy is a heat-management tool, not just a durability tool. Replication pays capacity overhead but lets you read from any replica — a free steering degree of freedom away from hot drives. Erasure coding (Reed-Solomon, k identity + m parity shards, read any k of k+m) reduces the capacity tax while keeping the same steering flexibility. S3 uses both. See concepts/erasure-coding, patterns/redundancy-for-heat.
Spread placement gives individual customers access to the whole fleet. S3 places a bucket's objects across different sets of drives per object. "Today, we have tens of thousands of customers with S3 buckets that are spread across millions of drives." A single Lambda-parallel genomics burst is served by over a million individual disks — a scale no single-tenant system could economically provision. See patterns/data-placement-spreading.
"AWS ships its org chart" — and in S3's case it's deliberate. The top-level S3 diagram (frontend fleet + namespace service + storage fleet + background data-services fleet) maps 1:1 to organizational groups. Sub-components recursively decompose into their own teams with their own fleets. Interactions between services are "literally API-level contracts" — modularity mistakes become inter-team friction, and fixing them is real engineering work on both the code and the org. Hundreds of microservices.
Durability reviews: a threat-model for durability changes. Borrowed from security's threat-modeling discipline. When a change can affect durability, engineers write: (a) a summary of the change, (b) a comprehensive list of threats, (c) how the change is resilient. "It encourages authors and reviewers to really think critically about the risks we should be protecting against" and "it separates risk from countermeasures" so the two can be debated independently. Preference is for coarse-grained guardrails — broad mechanisms that defeat whole classes of risks — over per-threat mitigations. See patterns/durability-review.
ShardStore + lightweight formal verification as a guardrail. ShardStore is S3's rewritten lowest-level per-disk storage layer, implemented in Rust for type safety (including libraries that extend type safety to on-disk structures). Alongside the production code, the team keeps an executable Rust model of ShardStore's logic — about 1% the size of the real system — in the same repo. Property-based testing generates scenarios that validate that implementation behavior matches specification behavior. "We even managed to publish a paper about this work at SOSP." The key organizational win: "we managed to kind of 'industrialize' verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don't have PhDs in formal verification can contribute." See systems/shardstore, concepts/lightweight-formal-verification, patterns/executable-specification.
Ownership as a scaling lever for people and teams. Teams (and engineers) who own a problem end-to-end — API contracts, durability, performance, 3-AM pages, post-incident improvements — move faster than teams who execute someone else's plan. Warfield's corollary from his professor days: "my most successful research projects were never mine. They were my students'." His conscious strategy at Amazon: articulate problems, not solutions. "There are often multiple ways to solve a problem, and picking the right one is letting someone own the solution." See concepts/ownership.
S3 is software + hardware + people as a single organism. "S3 is effectively a living, breathing organism. Everything, from developers writing code running next to the hard disks at the bottom of the software stack, to technicians installing new racks of storage capacity in our data centers, to customers tuning applications for performance, everything is one single, continuously evolving system." Design decisions trade off against all three axes.

Operational numbers (as of the FAST '23 keynote / post publication)¶

Metric	Value
Hard drives in S3	"Millions"
Microservices composing S3	"Hundreds"
Typical random-access IOPS per HDD	~120 (flat since 2006)
HDD bit error rate (marketing)	1 in 10¹⁵ reads
Largest HDD available (2023)	26 TB (WD Ultrastar DC HC670)
HDD capacity improvement since 1956 RAMAC	7.2M×
HDD physical shrink since 1956 RAMAC	5,000×
HDD cost improvement per byte (inflation-adj.)	6 billion×
HDD seek-time improvement since 1956	150× (the mechanical tax)
Drive-size roadmap this decade	200 TB
I/O budget at 200 TB/drive	1 IOPS per 2 TB
Individual customer burst capacity	>1,000,000 individual disks (e.g., Lambda-parallel genomics)
ShardStore spec size vs. production	~1%
Plane-flying-over-grass analogy	747, 75 mph, paper-thickness air gap, track = 4.6 blades wide, bit = 1 blade long, one miss per 25,000 Earth-circumnavigations

Systems / concepts / patterns surfaced¶

Systems - systems/aws-s3 — the subject. Enriches the existing page with physical-scale, heat-management, org-structure, and ShardStore angles. - systems/shardstore — new. S3's rewritten per-disk storage layer, in Rust, with an adjacent executable spec. Published at SOSP.

Concepts - concepts/heat-management — new. S3's term for managing request distribution across disks. - concepts/hard-drive-physics — new. Capacity-vs-seek-time divergence; the IOPS-per-TB decline. - concepts/erasure-coding — new. Reed-Solomon (k, m) redundancy; rationale for EC over replication at scale. - concepts/aggregate-demand-smoothing — new. Law-of-large-numbers at the multi-tenant workload level. - concepts/lightweight-formal-verification — new. The executable-spec flavor of formal methods that scales to a production team. - concepts/threat-modeling — new. Security-originated method repurposed for durability reviews. - concepts/ownership — new. Amazon's organizational primitive as a people-scaling lever. - concepts/noisy-neighbor — updated with S3 heat-management angle (spread placement, aggregate smoothing). - concepts/tail-latency-at-scale — updated: hotspot → queueing → amplification through metadata + EC reconstruct layers → tail-latency-everywhere. - concepts/performance-isolation — updated with S3 spread-placement approach (bucket's objects on disjoint drive sets).

Patterns - patterns/durability-review — new. Gated review process with a threat-model artifact for durability-affecting changes. - patterns/executable-specification — new. Specification authored in the same language as the implementation, committed alongside it, exercised on every build. - patterns/data-placement-spreading — new. Place different objects on different drive sets so any one workload reaches the whole fleet and no one drive inherits one customer's burst. - patterns/redundancy-for-heat — new. Use replication / erasure coding as I/O-steering mechanisms, not only durability mechanisms.

Caveats and context¶

Same author, different post. This is a FAST '23 keynote, republished/surfaced in 2025 via ATD; Warfield's 2025-03-14 post (sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes) is the S3-at-19 "simplicity" retrospective. The two pair naturally: this one is the physical / operational story, the other is the API / feature-properties story.
The 120 IOPS number is a round figure for random-access read/write IOPS on a high-capacity SATA/SAS HDD. Sequential access is far higher; the relevance is that S3's workload sees random access.
Erasure-coding parameters (k, m) are not disclosed. The post describes the scheme generically. The real S3 parameters are not in the public literature.
"Hundreds of microservices" is the only direct count of S3's internal service graph Warfield gives. The top-level diagram is 4 boxes; each recurses.
ShardStore's role. ShardStore is "the bottom-most layer of S3's storage stack – the part that manages the data on each individual disk." It is not the full S3. It was chosen for rewrite because disk-layer bugs are the most durability-critical and hardest to test against a real drive at 120 IOPS.
The published SOSP paper is "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" — useful companion reading.

Cross-references¶

Companion post: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes (Warfield, S3 at 19: the API/properties framing).
Sibling AWS storage history: sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws (Olson on EBS — shares queueing-theory framing and noisy-neighbor framework; EBS and S3 arrive at the same multi-tenant-variance problem from different substrates).
Tail-at-scale as a cross-domain concept: concepts/tail-latency-at-scale (DSQL GC-pause version, EBS queue version, Databricks LB version, now S3 heat version).