Skip to content

ALLTHINGSDISTRIBUTED 2025-03-14 Tier 1

Read original ↗

All Things Distributed: In S3 simplicity is table stakes (S3 at 19)

Summary

On S3's 19th birthday (Pi Day 2025), Andy Warfield (VP / Distinguished Engineer, S3) reframes what "simple" means for a storage system operating at hundreds of trillions of objects across 36 regions. Simplicity is not the 4-verb API (PUT/GET/DELETE/LIST); it is the property of "working with your data and not having to think about anything else" — i.e., the elimination of distractions like capacity provisioning, performance tuning, request-retry strategy, bucket accounting, and now table maintenance (compaction/GC) for tabular data. The post traces several feature arcs where S3 walked back self-imposed "sharp edges": strong read-after-write consistency (2020), conditional writes (2024), per-account bucket limit 100 → 1M (2024), throughput elasticity via the Common Runtime (CRT), SSD/low-latency via S3 Express One Zone (2023), and — the headline architectural move of 2024 — S3 Tables, which lifts Apache Iceberg from a customer-managed layer over objects to a first-class S3 primitive alongside the object. Implicit thesis: the properties of S3 storage (security, elasticity, availability, durability, performance), not the object API, are what actually define it — so when customers started running multi-petabyte Iceberg tables on S3, the "table" had to become a first-class construct with the same properties.

Key takeaways

  1. "Simple" is a property of the experience, not the API surface. "A lot of people associate the term simple with the API itself… I'm not sure this is the aspect of S3 that we'd really use 'simple' to describe. Instead, we've come to think about making S3 simple as something that turns out to be a much trickier problem — we want S3 to be about working with your data and not having to think about anything other than that." The architectural lever for this is concepts/elasticity: "you never have to do up front provisioning of capacity or performance, and you don't worry about running out of space… we are successful only when these things can be taken for granted, because it means they aren't distractions."
  2. Strong consistency and conditional writes are felt as code-deletion features. On the 2020 move to concepts/strong-consistency: "in meeting after meeting, builders spoke about deleting code and simplifying their systems." The 2024 rollout of conditional operations (compare-and-set against object metadata/version) produced the same reaction. Simplicity wins are measured by what they delete from customer code — see patterns/conditional-write.
  3. The bucket-limit bump (100 → up to 1M) was a small number with a large blast radius. Buckets were historically a human construct (admin creates in console, tracks in a spreadsheet). Customers wanted them as a programmatic one — per-dataset, per-tenant, per-customer buckets for policy and sharing. Lifting the limit required rewriting Metabucket (the bucket-metadata system, separate from the object-metadata namespace, already rewritten for scale more than once), a new paged ListBuckets API, an opt-in 10K soft-limit guardrail, and cross-service fixes: console widgets in other AWS services (e.g. Athena) that ListBuckets + HeadBucket-per-bucket on render could take tens of minutes with very large bucket counts. "There were more subtle aspects of addressing this scale… we had to work across tens of services on this rendering issue."
  4. Performance elasticity mirrors capacity elasticity. "Any customer should be entitled to use the entire performance capability of S3, as long as it didn't interfere with others." Two legs: (a) be transparent about S3's design — "request parallelization and retry strategies" documented, then baked into the Common Runtime (CRT) library; today "individual GPU instances using the CRT to drive hundreds of gigabits per second in and out of S3"; (b) launch systems/s3-express-one-zone (2023) — first SSD storage class, single-AZ by design to minimise latency. "Anthropic driving tens of terabytes per second" cited as the high-water workload.
  5. The flywheel of performance → demand → performance. "Improvements in performance drove demand for even more performance, and any limitations became yet another source of friction that distracted developers from their core work." Entertainment streaming directly from S3 and ML training reading at TB/s are cited as the "interactive-workload" frontier for an originally-archival-tier service.
  6. The simplicity / velocity tension is explicit and deliberate. "There's actually a really important tension between simplicity and velocity, and it's a tension that kind of runs both ways." One side: over-designing for perfection prevents shipping. Other side: racing to ship painful gaps backloads work that is more expensive to simplify later. "The improvements that we make toward simplicity are really improvements against an initial feature that wasn't simple enough… we launch things that need, over time, to become simpler." This is a first-class engineering concept on the S3 team — see concepts/simplicity-vs-velocity.
  7. S3 Tables: Iceberg becomes a first-class S3 construct, not a library pattern. Background: Parquet (2013) is the de-facto tabular object format — "S3 stores exabytes of parquet data and serves hundreds of petabytes of Parquet data every day." Iceberg (2017) adds a metadata / snapshot layer over Parquet for row-level updates, schema evolution, and time-travel. But because Iceberg's table structure is externalised — customer code owns the data-to-metadata object relationships, compaction, GC, tiering — existing S3 features don't apply cleanly: Intelligent-Tiering and cross-region replication don't understand Iceberg's logical structure, and small snapshot-based updates fragment tables, requiring application-level compaction to recover performance. The quote from large customers: "with Iceberg what they were really doing was building their own table primitive over S3 objects, and they asked us why S3 wasn't able to do more of the work." See systems/s3-tables and concepts/open-table-format.
  8. S3 Tables' three architectural moves (re:Invent 2024). (a) Each table has its own endpoint and is a first-class policy resource — you set access on the table, not on its constituent objects. (b) New APIs for table creation and snapshot commit so creating / committing a snapshot is a single S3 call rather than a customer-side Iceberg sequence. (c) S3 understands the Iceberg layout internally and performs compaction / GC / tiering as managed operations — same way it manages object-level placement and tiering. Post-launch deltas (in the first 14 weeks): Iceberg REST Catalog (IRC) API support, in-console query. Collaboration called out: DuckDB Iceberg support acceleration.
  9. The real definition of S3 is "these storage properties", not "objects". "Historically, we've always talked about S3 as an object store… I think one thing that we've learned from the work on Tables is that it's these properties of storage that really define S3 much more than the object API itself." Expectation customers voiced: "all the things that S3 is for objects, but for a table." Forward bet: the same properties will eventually apply to other non-object structures (e.g. embedded-DB files like SQLite) where analytics engines and application code share one live dataset.
  10. Culture note — customers guess features by probing. Before a new S3 REST verb is launched, dashboards often start showing traffic to it. Some customers (e.g. Turbopuffer's CEO Simon Hørup Eskildsen) run hourly scripts monitoring the S3 "What's new" feed. This customer- driven feature-request loop is explicit in how S3 prioritises work — see patterns/customer-driven-prioritization.

Architectural diagrams / numbers

  • Scale (early 2025): hundreds of trillions of objects, 36 regions.
  • Parquet-on-S3 workload: "exabytes" stored; "hundreds of petabytes of Parquet data" served per day.
  • Bucket limits: 100/account → up to 1M/account (opt-in beyond 10K).
  • Consistency milestone: Dec 2020 → strong read-after-write consistency, displacing the prior eventual-consistency-for-overwrites model.
  • Conditional writes: GA 2024 (general-purpose buckets).
  • S3 Express One Zone: single-AZ, SSD, launched 2023 as first low-latency storage class.
  • CRT throughput anecdote: single GPU instance pushes "hundreds of gigabits per second" to/from S3; Anthropic at "tens of terabytes per second" at the account level.
  • Metabucket: rewritten for scale "more than once" before this round; distinct from the object-metadata namespace.

Caveats

  • No internal block diagrams — the post describes architectural decisions and properties, not the current control-plane or data-plane topology. There is no equivalent of the published Dynamo or Spanner papers here.
  • Numbers are order-of-magnitude rhetorical ("tens of terabytes per second", "hundreds of gigabits per second", "hundreds of petabytes per day"), not measured SLOs.
  • S3 Tables is new (14 weeks at time of writing); post is explicit that many simplicity gaps remain (the velocity side of the simplicity/velocity tension is winning for now).
  • The "first-class table abstraction" framing glosses over that Iceberg itself is an open format — a risk is that S3 Tables' managed semantics diverge subtly from customer-managed Iceberg over time. Post doesn't address this.
Last updated · 200 distilled / 1,178 read