SYSTEM Cited by 10 sources
Amazon S3 (Simple Storage Service)¶
Amazon S3 is AWS's foundational object storage service, launched March
14, 2006 as the first public AWS service. By early 2025 it holds
hundreds of trillions of objects across 36 regions and serves as the
primary storage for nearly every AWS service and a large share of the
public internet's data lakes. Originally a 4-verb HTTP API
(PUT/GET/DELETE/LIST) over immutable objects grouped into
buckets, it has evolved into a platform whose defining characteristic —
per its own team — is not the object API but the properties of the
storage: elasticity, durability, availability, security, and
performance. "Making S3 simple" is treated as an ongoing program of
removing distractions so builders "work with their data and not have
to think about anything else."
(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)
Defining properties (Warfield, 2025)¶
- Elastic capacity — no upfront provisioning, no per-bucket capacity limits, no notion of "running out of space". See concepts/elasticity.
- Elastic performance — "any customer should be entitled to use the entire performance capability of S3, as long as it didn't interfere with others." Throughput discipline is exposed via the systems/aws-crt (Common Runtime) library: a GPU instance can drive "hundreds of gigabits per second in and out of S3".
- Strong read-after-write consistency (since 2020) — customers report "deleting code and simplifying their systems" after the guarantee landed. See concepts/strong-consistency.
- Very high durability and availability — taken for granted by design ("we are successful only when these things can be taken for granted").
- Immutable objects as the low-level primitive — see concepts/immutable-object-storage. Higher-level primitives (versioning, object lock, cross-region replication, S3 Tables) are layered on top.
Evolution arcs called out in the 2025 post¶
API surface: 4 verbs, many properties¶
"A lot of people associate the term simple with the API itself — that an HTTP-based storage system for immutable objects with four core verbs… is a pretty simple thing to wrap your head around. But looking at how our API has evolved… I'm not sure this is the aspect of S3 that we'd really use 'simple' to describe." The architectural claim is that the properties of S3 storage, not the verbs, define it.
Consistency: eventual → strong (Dec 2020)¶
Moving to strong read-after-write consistency for overwrites and LIST was not marketed as a performance feature; its value was measured in customer code deleted. Retrofit of this guarantee onto a globally distributed object store is a rare example of a public cloud service trading hard internal engineering for a net-simpler customer API.
Conditional operations (2024)¶
PUT-If-Match / compare-and-swap semantics against object
metadata/version enable atomic multi-writer coordination — see
patterns/conditional-write. Rollout pattern matched consistency:
customers used it to delete external locking / versioning code.
Bucket limits: 100 → up to 1M per account (Nov 2024)¶
Historically buckets were a human construct — created in the console, tracked by an admin — and capped at 100/account. Customers wanted buckets as a programmatic per-dataset / per-tenant resource, for policy and sharing. The rewrite required:
- Scaling Metabucket (S3's bucket-metadata system, distinct from the object-metadata namespace) — already rewritten for scale more than once before this round. See systems/metabucket.
- A new paged
ListBucketsAPI. - An opt-in soft limit of 10K beyond the old 100, to prevent the
very problem a test account with millions of buckets surfaced — AWS
console widgets in other services (e.g. Athena) that
ListBuckets+HeadBucket-per-bucket on render could take tens of minutes at high bucket counts. - Cross-service fixes across "tens of services" on that rendering pattern.
This is the canonical example of limit-removal is a cross-team engineering project, not a config change.
Performance: throughput then latency¶
- Throughput elasticity — publish S3's request-parallelization and retry strategies; bake them into the systems/aws-crt so any language bindings get the same performance. Individual GPU instances driving "hundreds of gigabits per second"; Anthropic at "tens of terabytes per second" account-level.
- Latency tier — systems/s3-express-one-zone (2023), first SSD storage class, single-AZ by design to minimise latency. Trades multi-AZ resilience for tail latency on hot data.
S3 Tables (re:Invent 2024) — the object→table move¶
Until 2024, tables on S3 were a customer-managed open table format over Parquet (typically systems/apache-iceberg). Large customers pointed out they were "building their own table primitive over S3 objects" and asked S3 to own the cross-object structure. S3 Tables lifts Iceberg to a first-class S3 resource:
- Own endpoint per table.
- Table is the policy resource (not the constituent objects).
- Managed compaction, GC, and tiering at the Iceberg layout level.
- New APIs for table creation and snapshot commit.
See systems/s3-tables and concepts/open-table-format. The architectural claim behind Tables: "it's these properties of storage that really define S3 much more than the object API itself" — therefore tables can be a first-class S3 construct alongside objects without contradicting what S3 is.
Internal systems referenced¶
- systems/metabucket — bucket metadata store (separate from the object-namespace metadata).
- systems/aws-crt — Common Runtime library exposing S3 best-practice request parallelization / retry to SDKs.
- systems/s3-express-one-zone — SSD, single-AZ low-latency class (2023).
- systems/s3-tables — managed-Iceberg first-class table resource (2024).
- systems/shardstore — rewritten per-disk storage layer (Rust, executable-spec validated). See FAST '23 keynote + SOSP paper.
Physical + operational story (Warfield FAST '23 keynote, 2025)¶
The 2025-02-25 ATD post gives the operational story that pairs with the "simplicity" retrospective above. See sources/2025-02-25-allthingsdistributed-building-and-operating-s3 for the full write-up.
Built out of millions of hard drives¶
- S3 is composed of "hundreds of microservices" and "millions" of hard drives.
- A single HDD delivers about 120 random-access IOPS, and that number has been flat since before S3 launched in 2006. Capacity has grown 7.2M× since the 1956 RAMAC; seek time only 150×. See concepts/hard-drive-physics.
- Industry HDD roadmap: 200 TB/drive this decade. At that point a drive supports 1 IOPS per 2 TB. S3 will use them anyway.
Heat management¶
- Heat = requests per disk per unit time. Hotspots queue requests and that queueing amplifies through dependent layers (metadata lookups, erasure-coding reconstructs) into concepts/tail-latency-at-scale.
- Two levers: (1) Spread each bucket's objects across different drive sets — any one customer's data is a tiny fraction of any one drive, and a single customer's burst reaches millions of drives (patterns/data-placement-spreading). (2) Use redundancy as a steering tool: replication gives N read sources per logical read; concepts/erasure-coding (Reed-Solomon, k identity + m parity, read any k of k+m) gives both capacity efficiency and steering flexibility. See patterns/redundancy-for-heat.
- Why this works: concepts/aggregate-demand-smoothing. Millions of bursty tenants aggregate into a smooth demand curve no single workload can move — so the placement problem reduces to translating smooth aggregate into smooth per-drive load.
Durability as a human + organizational mechanism¶
- Durability reviews (patterns/durability-review): every durability-affecting change carries a threat-model artifact (concepts/threat-modeling) — summary, list of threats, how the change is resilient. Explicit preference for coarse-grained guardrails over per-risk mitigations.
- systems/shardstore as a canonical guardrail: S3's rewritten per-disk storage layer, in Rust, with an executable specification (~1% the code size) checked into the same repo and tested against the real implementation on every commit via property-based testing. Frames concepts/lightweight-formal-verification as an industrialized technique — normal engineers can maintain the spec without formal-methods PhDs. Published at SOSP.
"AWS ships its org chart" — applied ownership¶
- S3's top-level block diagram (frontend fleet + namespace + storage fleet + background data services) maps 1:1 to organizational groups, and every sub-component recurses into its own teams with their own fleets. Inter-team interactions are literal API contracts.
- This is an instance of concepts/ownership as a scaling primitive: teams go faster when they own their services end-to-end (API, durability, performance, 3-AM pages, post-incident improvements).
- Warfield's personal generalization: senior-engineer leverage comes from articulating problems, not dispensing solutions — "my best ideas are the ones that other people have instead of me."
Storage platform with multiple first-class data primitives (2024-2026)¶
Between re:Invent 2024 and 2026-04 the S3 team added three new first-class data primitives, each a distinct presentation over the same S3 storage properties (elasticity / durability / availability / performance / security) — see patterns/presentation-layer-over-storage. The 2026-04-07 "S3 Files" post is explicit that this now defines S3's architectural trajectory: not an object store that added features, but a storage platform whose API surface is a set of data primitives chosen to fit how applications actually want to work with data.
| Primitive | Launch | Page |
|---|---|---|
| Objects | 2006 | This page — 4-verb API over immutable blobs |
| Tables | re:Invent 2024 | systems/s3-tables — managed Iceberg; table as policy resource |
| Vectors | preview 2025-07-16 | systems/s3-vectors — elastic similarity-search indices; Cosine/Euclidean; 10K indexes/bucket × tens-of-M vectors/index; up-to-90% TCO claim |
| Files | 2026-04-07 | systems/s3-files — NFS mount over S3 data, backed by EFS |
Warfield's framing (2026):
"Different ways of working with data aren't a problem to be collapsed. They're a reality to be served."
And the strategic argument for why storage gets broader as AI agents change application lifetimes:
"As the pace of application development accelerates, this property of storage has become more important than ever, because the easier data is to attach to and work with, the more that we can play, build, and explore new ways to benefit from it."
See concepts/agentic-data-access.
S3 Files: the "boundary-as-feature" design breakthrough¶
systems/s3-files is worth calling out on this page because it crystallised a design principle — concepts/boundary-as-feature — that generalises beyond storage. Key design arc:
- Six months of "EFS3" convergence design failed. Trying to fuse file and object into one unified system produced "a battle of unpalatable compromises" — the lowest common denominator, not the best of both worlds.
- The breakthrough was inverting the goal. Stop hiding the file/object boundary; make the boundary the feature. See concepts/file-vs-object-semantics for the enumerated asymmetries (mutation granularity, atomicity, authorization, namespace semantics, namespace performance).
- concepts/stage-and-commit (term borrowed from git) is the translation mechanism: file-side changes accumulate, commit back to S3 as one PUT roughly every 60 seconds. Bidirectional sync. Conflict policy: S3 wins; file-side loser → lost+found + CloudWatch metric.
- concepts/lazy-hydration: first directory access imports metadata as a background scan so mount-and-work is instantaneous even on multi-million-object buckets; file data < 128 KB co-hydrates, larger files hydrate on read; 30-day idle eviction keeps active working set proportional.
- "Read bypass": sequential-read throughput reroutes off NFS to parallel direct-GETs against S3 — 3 GB/s per client, Tbps across many clients.
- Known edges (called out explicitly): directory rename is O(objects) (mount warning > 50M objects); no programmatic explicit- commit API at launch; some S3 keys aren't valid POSIX filenames.
See patterns/explicit-boundary-translation for the generalised pattern and systems/aws-efs for the under-the-covers backing.
Design principles visible¶
- Remove limits, not feature-flag them — strong consistency, conditional writes, bucket-count ceilings: the pattern is to eliminate the sharp edge rather than expose it as a knob.
- Design for the property, not the API — concepts/elasticity and concepts/strong-consistency are the properties developers feel; the API is what changes least.
- patterns/customer-driven-prioritization — features are prioritised from direct conversations with customer-builders; customers probe unreleased REST verbs before launch.
- concepts/simplicity-vs-velocity — explicitly acknowledged tension: every simplification improves against an earlier feature that wasn't simple enough; racing to ship backloads simplification work that's more expensive later.
Seen in¶
- sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3 — Andy Warfield on the launch of S3 Files (2026-04-07) and the broader reframing of S3 as a multi-primitive storage platform. Introduces systems/s3-files (NFS mount over S3 data, EFS-backed), situates systems/s3-vectors (launched Nov 2025) in the lineage, and positions the three new primitives (Tables / Vectors / Files) alongside objects. Canonical articulation of concepts/boundary-as-feature — the "EFS3" convergence design dead end, and the post-holidays-2024 pivot to concepts/stage-and-commit as a programmable boundary primitive. Names concepts/agentic-data-access as an emerging design concern as agentic coding compresses application lifetimes and makes storage's decoupling role more load-bearing. Enumerates file/object semantic asymmetries (concepts/file-vs-object-semantics) in the most detail of any public AWS source, and frames the design discipline for resolving them (patterns/explicit-boundary-translation). Reported numbers: 2M+ tables in S3 Tables; 300B+ event notifications/day; 25M+ requests/sec to Parquet data alone; S3 Files read-bypass 3 GB/s per client / Tbps across clients; 60s commit cadence; 128 KB lazy- hydration threshold; 30-day file-side eviction; >50M-objects mount warning.
- sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — Andy Warfield's FAST '23 keynote (republished on ATD in 2025-02-25). Complementary to the 19-birthday post: this one is the physical and operational story of S3. Surfaces (1) hard-drive physics — ~120 IOPS/drive, flat since before S3 launched; 26 TB today, 200 TB on the roadmap, so 1 IOPS per 2 TB at that point; see concepts/hard-drive-physics. (2) Heat management — requests per drive as a first-class placement problem; hotspots produce queueing → stragglers → concepts/tail-latency-at-scale; see concepts/heat-management. (3) Aggregate demand smoothing — millions of bursty tenants aggregate into a smooth curve no single one can move; see concepts/aggregate-demand-smoothing. (4) Spread placement + redundancy-for-heat — a bucket's objects on disjoint drive sets, letting one Lambda-parallel burst touch >1M disks; see patterns/data-placement-spreading, patterns/redundancy-for-heat, concepts/erasure-coding. (5) Organizational scale — "AWS ships its org chart," hundreds of microservices, durability reviews as threat-model for durability changes (patterns/durability-review, concepts/threat-modeling), systems/shardstore + concepts/lightweight-formal-verification as a guardrail, and concepts/ownership as the people-scaling lever.
- sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — Andy Warfield's 19th-birthday retrospective. Canonical post for the "properties, not API" framing, strong consistency / conditional-writes / bucket-limit / S3 Tables arcs.
- sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years — day-one Lambda PR/FAQ points to S3 as the persistent store for stateless Lambda handlers: "persistent state should be stored in Amazon S3, Amazon DynamoDB, or another Internet-available storage service."
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — S3 as the durable substrate of Amazon Retail BDT's exabyte-scale data catalog: 50+ PB of Oracle table data landed on S3 in the 2016-2018 migration (wrapped in an systems/amazon-ion schema); swappable compute engines (Hive, Redshift, Spark, Athena, Flink, Glue, and now Ray) front the same S3 storage. Operational numbers from Q1 2024: 1.5 EiB of Parquet input on S3 compacted in a single quarter, >20 PiB/day input S3 read across >1,600 Ray jobs/day. Concrete "swap compute, keep storage" realisation of concepts/compute-storage-separation at exabyte scale.
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod
— S3 in two named roles on the 2026-04-06 HyperPod Inference
Operator EKS add-on: (1) TLS-certificate store for the
operator's cert-manager issuance flow (a dedicated bucket named
by the
tlsCertificateS3Bucketparameter of the add-on config — configured by VPC endpoint for in-VPC access); (2) model-weight store —InferenceEndpointConfigbring-your- own-model deployments reference weights on S3, loaded by the Mountpoint for Amazon S3 CSI driver bundled as a default dependency add-on. Instance in the long-running arc of S3-as- default-persistent-substrate for stateless managed-compute services. - sources/2024-02-15-flyio-globally-distributed-object-storage-with-tigris
— S3 named in two adjacent roles in Tigris's
architecture: (1) incumbent being improved on — Fly.io's framing
of the single-write-region + CloudFront
pattern as "no way to build a sandwich reviewing empire" for globally
distributed users; (2) pluggable backend / archival tier —
Tigris's QuiCK-style distribution queue propagates bytes out to
"3rd party object stores… like S3", meaning Tigris can be
configured with S3 as cold-tier origin while the regional NVMe /
FoundationDB front handles hot distribution. Plus the S3-compatible
API on Tigris's front: "If your framework can talk to S3, it can
use Tigris" — the AWS SDK works unchanged via an
AWS_ENDPOINT_URL_S3override. First wiki example of S3's API shape being explicitly re-used as the presentation layer over a different-shaped backend (a concrete case of patterns/presentation-layer-over-storage at the storage-API level). - sources/2025-05-20-flyio-litestream-revamped — CASAAS consumer entry. Fly.io's 2025-05-20 Litestream redesign cites S3's 2024-11 conditional-write launch as the load-bearing enabler for retiring Litestream's pre-existing "generations" abstraction: "Modern object stores like S3 and Tigris solve this problem for us: they now offer conditional write support. With conditional writes, we can implement a time-based lease." Canonical wiki instance of patterns/conditional-write-lease on S3 — S3's strong-consistency-plus-conditional-writes pair now substitutes for Consul / etcd in a production replication-coordination role, not just catalog-snapshot commits. Extends the customer-code-deletion framing (concepts/simplicity-vs-velocity) into a new substitution: coordination services as the thing S3 can replace for single-writer workloads.
- sources/2025-10-02-flyio-litestream-v050-is-here — CASAAS-shipped + newer-S3-APIs datapoint. Litestream v0.5.0 ships the CASAAS lease on S3 conditional writes that the 2025-05-20 post described; the post also notes "We've upgraded all our clients (S3, Google Storage, & Azure Blob Storage) to their latest versions. We've also moved our code to support newer S3 APIs" — implicit reference to the 2024-11 S3 conditional-writes feature CASAAS depends on (client-SDK currency was a precondition to shipping the revamp). Not a new S3-side disclosure — rather the first production-shipping instance of CASAAS-on-S3 in the wiki, distinct from the 2025-05-20 design-post framing.