High Scalability — Behind AWS S3's Massive Scale¶
Summary¶
Third-party explainer by Stanislav Kozlovski (Apache Kafka committer, writing as a guest for High Scalability, 2024-03-06) distilling AWS's public material on Amazon S3 into a single tour of the system's architectural substrate. The post surfaces the 18-year feature timeline (2006 launch → 2023 S3 Express), the 100M req/sec, 400 Tbps, 280 trillion objects scale numbers, and the now-canonical Warfield framing: S3 is >300 microservices built on millions of HDDs, solving the heat management problem via aggregate demand smoothing + erasure coding + spread placement. Adds a few datapoints not already on the wiki — S3's architecture as a textbook instance of Conway's Law, ShardStore as an LSM-tree-with-out-of-tree-shards, the cache-coherency witness component behind 2020's strong-consistency launch, and the durable chain of custody checksum-as-HTTP-Trailer pattern at the SDK boundary. Tier-1 aggregator re-telling of AWS-published material — the underlying claims trace back to Warfield's FAST '23 keynote (see sources/2025-02-25-allthingsdistributed-building-and-operating-s3) and the SOSP 2021 ShardStore paper. Canonical as a compact index into S3's architectural design principles for readers who haven't yet read the primary sources.
Key takeaways¶
- S3 at 2024 scale, in numbers: 100M requests/sec · 400 Tbps · 280 trillion objects · 31 regions · 99 AZs · >300 microservices · millions of hard drives. (Kozlovski/HighScalability, 2024)
- S3 is a textbook instance of Conway's Law. The four top-level boxes — frontend REST fleet, namespace service, storage fleet, storage-management background fleet — each correspond to a distinct S3 organization with its own leaders and teams. Each sub-box recurses into its own nested component-and-team structure. Inter-team interactions are literal API-level contracts. (Source: sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale).
- ShardStore is an LSM tree with out-of-tree shard data. Kozlovski adds detail to the Warfield framing: the rewritten per-disk storage layer is a log-structured merge tree (see concepts/lsm-compaction) with shard data stored outside the tree to reduce write amplification, soft-updates-based crash consistency, designed ground-up for concurrency + HDD IO scheduling/coalescing. Originally ~40k lines of Rust. Validated by lightweight formal verification integrated into CI/CD.
- HDDs are becoming slower per byte. 1956 RAMAC ($9k, 3.75MB) → 2024 26TB at $15/TB is 6 billion× cheaper per byte, 7.2M× capacity, 5000× smaller, 1235× lighter — but still ~120 IOPS/drive, a number that's been flat for decades. See concepts/hard-drive-physics.
- Erasure coding with (k, m) is S3's storage scheme. Kozlovski gives the Reed-Solomon primer — 10 identity + 6 parity shards lets you lose up to 6, far cheaper than 3× replication for equal tolerance. And critically: the extra shards give you scheduling flexibility for heat management, not only durability. See patterns/redundancy-for-heat.
- Spread placement is the write-time heat lever. Tens of thousands of customers' data spread across millions of drives: (1) no single customer can hot-spot a drive, (2) bursts parallelize across drives they'd never be able to afford stand-alone, (3) more spread = more durability, (4) no read amplification on a single drive. See patterns/data-placement-spreading.
- Workload decorrelation is the scale trick that makes heat management tractable. Individual workloads are idle-then-bursty; millions aggregated smooth into a predictable throughput curve. Single-tenant storage arrays cannot do this — it's a property of scale + multi-tenancy, not algorithms.
- Parallelism has aligned incentives. Kozlovski crystallizes the S3 best-practice guidance: use many clients × many connections × many endpoints; within a single op, use multipart upload for PUT and HTTP Range header for GET. See patterns/multipart-upload-parallelism.
- Strong read-after-write consistency (2020) launched with a cache-coherency witness. S3's metadata tier used a resilient cache where writes and reads could flow through different cache partitions — that was the main source of pre-2020 eventual consistency. The fix: per-object replication logic for write ordering, plus a new component that acts as a write-witness and read-barrier, invalidating a cache view the moment it may be stale. Delivered with zero hit to performance, availability, or cost.
- Durability = failure rate / repair rate, continuously. S3 offers 11 nines (one object lost per 10M years per 10k stored). Hardware: detectors track drive failure rates and scale the repair fleet accordingly; a durability model runs in the background to verify actual durability ≥ target. Software: durability reviews as threat-model-style gates for any code change that touches durability, plus "coarse-grained guardrails" preferred over per-bug mitigations. See concepts/threat-modeling.
- Durable chain of custody. To close the network-corruption gap before data reaches S3, the SDK computes a checksum and appends it as an HTTP Trailer (avoids scanning the payload twice). The checksum travels with the request through the entire pipeline, so any byte flip en route is caught and rejected.
Numbers surfaced¶
| Quantity | Value | Year |
|---|---|---|
| Requests/sec | 100 million | 2024 |
| Bandwidth | 400 Tbps | 2024 |
| Objects stored | 280 trillion | 2024 |
| Regions × AZs | 31 × 99 | 2024 |
| Microservices in S3 | >300 | 2024 |
| Hard drives | millions | 2024 |
| HDD IOPS (random) | ~120 | (flat since <2006) |
| HDD: 1956 → 2024 price/byte | 1 / 6,000,000,000 | — |
| HDD: 1956 → 2024 capacity | × 7,200,000 | — |
| HDD: 1956 → 2024 physical size | ÷ 5,000 | — |
| HDD: 1956 → 2024 weight | ÷ 1,235 | — |
| ShardStore original LOC | ~40k lines (Rust) | — |
| Durability (AZ-redundant standard) | 11 × 9s | — |
| Example EC (k, m) | (10, 6) | — |
| Imbalance #1 | 3.7 PB / 2.3M IOPS → 143 drives for capacity vs 19,166 for IOPS (13,302% more) | — |
| Imbalance #2 | 28 PB / 8,500 IOPS → 71 drives for IOPS vs 1,076 for capacity (1,415% more) | — |
Systems named¶
- systems/aws-s3 — the subject. 2006 launch, now the storage backbone of AWS / many cloud-native data infrastructures.
- systems/shardstore — S3's per-disk storage layer. LSM tree, Rust, soft-updates crash consistency, lightweight formal verification, ~40k LOC original.
- S3 Glacier (2012) — low-cost archival tier, retrieval minutes-to-hours. (No dedicated wiki page yet.)
- S3 Intelligent-Tiering (2013) — auto-moves objects between tiers by access pattern. (No dedicated wiki page yet.)
- S3 Standard-Infrequent Access (2015) — cheaper tier for infrequently-accessed-but-fast-retrieval data. (No page.)
- S3 Glacier Deep Archive (2018) — cheaper Glacier, 12-hour retrieval. (No page.)
- S3 Object Lambda (2021) — customer code modifies GET responses. (No page.)
- systems/s3-express-one-zone — 2023, 10× lower latency, 50% cheaper requests, single-digit-ms, single-AZ SSD.
Concepts named¶
- concepts/conways-law — the architecture-mirrors-org-chart theory, applied to S3's top-level decomposition.
- concepts/hard-drive-physics — the 120 IOPS / capacity-vs-seek divergence.
- concepts/heat-management — the request-per-drive load balancing problem at S3's scale.
- concepts/erasure-coding — (k, m) Reed-Solomon encoding, used as both durability scheme and heat-steering mechanism.
- concepts/aggregate-demand-smoothing — workload decorrelation at multi-tenant scale.
- concepts/strong-consistency — 2020 read-after-write guarantee.
- concepts/cache-coherency-witness — the new mechanism that made strong consistency possible without perf hit.
- concepts/lsm-compaction — ShardStore's backbone data structure.
- concepts/write-amplification — what out-of-tree shard storage is designed to reduce.
- concepts/lightweight-formal-verification — how S3 validates ShardStore.
- concepts/immutable-object-storage — S3's low-level primitive.
- concepts/threat-modeling — the security-review framing ported to durability.
Patterns named¶
- patterns/data-placement-spreading — write-time placement across many drives.
- patterns/redundancy-for-heat — replicas/EC-shards as I/O-steering degrees of freedom.
- patterns/durability-review — threat-model-style gate on durability-touching code changes.
- patterns/multipart-upload-parallelism — client-side parallelism for PUT/GET throughput (multipart + Range).
- patterns/durable-chain-of-custody — SDK-side checksum in HTTP Trailer propagated end-to-end.
Caveats¶
- Third-party explainer, not a primary source. Kozlovski is summarizing AWS's public material — the Warfield FAST '23 keynote, the SOSP 2021 ShardStore paper, AWS documentation, and Pi-Day 2024 S3 facts. For the authoritative versions of these claims, prefer the ATD posts themselves (see systems/aws-s3 "Seen in"). This post's value is as a compact index into those claims, with some extra detail (ShardStore internals, cache-coherency witness, chain of custody) that isn't in any one primary source.
- The "300+ microservices" number is restated, not newly disclosed — Warfield's FAST '23 keynote uses "hundreds of microservices".
- The 100M req/sec / 400 Tbps / 280T objects numbers are 2024-era. S3 scale has grown since (see the 2025 ATD posts for newer ceilings — "hundreds of trillions of objects across 36 regions").
- ShardStore soft-updates-based crash consistency is a Kozlovski-level-of-detail claim; the SOSP paper has the authoritative treatment. Same for the LSM-with-out-of-tree shards structure and the ~40k LOC figure.
- The cache-coherency witness mechanism for strong consistency is restated at a conceptual level; primary-source details of the component (name, fanout, failure modes) aren't public.
- The "imbalance" numeric examples (143 vs 19,166 drives etc) are Kozlovski's illustrative math on public AWS performance numbers, not AWS-published workload profiles.
- Glacier / Intelligent-Tiering / Standard-IA / Deep Archive / Object Lambda get timeline mentions only; no architectural detail here. If one of them becomes architecturally important later, a dedicated page can be created then.
Source¶
- Original: https://highscalability.com/behind-aws-s3s-massive-scale/
- Raw markdown:
raw/highscalability/2024-03-06-behind-aws-s3s-massive-scale-dd0bd8c4.md
Related¶
- systems/aws-s3
- systems/shardstore
- concepts/conways-law
- concepts/cache-coherency-witness
- patterns/multipart-upload-parallelism
- patterns/durable-chain-of-custody
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — Warfield's FAST '23 keynote (primary source for most of what Kozlovski restates).
- sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — Warfield's 19th-birthday retrospective.
- companies/highscalability