All Things Distributed: S3 Files and the changing face of S3¶
Summary¶
Andy Warfield (VP/DE, S3) announces systems/s3-files, a new S3 feature that integrates Amazon EFS under S3 and lets any S3 bucket or prefix be mounted as a network filesystem from EC2, containers, or Lambda. The post spends most of its length on the design story: the team began with the obvious goal of fusing file and object into a single storage system ("EFS3"), spent most of 2024 finding that every convergence design produced a "battle of unpalatable compromises," and eventually inverted the problem — the boundary between file and object semantics was the feature, not a limitation to hide. The resulting architecture is stage and commit (a term borrowed from version control): changes accumulate in EFS, then batch-commit to S3 roughly every 60 seconds, with S3 as source of truth on conflict. The post also situates S3 Files as the third move in a multi-year program that reframes S3 from object store to storage platform with multiple first-class data primitives — alongside systems/s3-tables (GA Dec 2024) and systems/s3-vectors (launched Nov 2025). Thesis echoes the 2025-03-14 S3-at-19 post: "it's these properties of storage that really define S3 much more than the object API itself," and the role of storage is to abstract and decouple data from individual applications, a property that matters more as agentic coding compresses application lifetimes.
Key takeaways¶
- Data friction at the file/object boundary is the root problem, observed across genomics, M&E, ML pretraining, silicon design, and scientific computing. Warfield's UBC origin story: JS Legare at the Rieseberg sunflower-genomics lab built "bunnies," a containerized analysis layer over S3, and most of the friction was at the storage boundary — tools expected a local Linux filesystem; data lived in S3; "researchers were forever copying data back and forth, managing multiple, sometimes inconsistent copies." This friction is the generalisable problem S3 Files attacks.
- Agents amplify data friction. An agent reasoning about a dataset has to add "list S3 → copy to local disk → operate on local copies" to its reasoning chain; native POSIX access collapses that to one step. Beyond agents, every customer application with Unix-tool pipelines hits the same shape. As application lifetimes shorten and domain experts (not specialist coders) build more applications, storage's role as the stable layer beneath ephemeral applications grows. "As the pace of application development accelerates, this property of storage has become more important than ever." See concepts/agentic-data-access.
- S3 has been quietly turning into a storage platform with multiple first-class data primitives. Three now live: objects (2006 origin, immutable), tables (re:Invent 2024, managed Apache Iceberg), vectors (re:Invent 2025, similarity search indices), and now files (2026-04-07). Each is designed to hit S3's baseline properties — elasticity, durability, availability, performance, security — in the presentation best-suited to its workload.
- S3 Tables' lesson was structural, not product-tactical. Customers "told us that managing security policy was difficult, that they didn't want to have to manage table maintenance and compaction, and that they wanted working with tabular data to be easier… a lot of work on Iceberg and OTFs generally was being driven specifically for Spark. While Spark is very important…, people store data in S3 because they want to be able to work with it using any tool they want, even (and especially!) the tools that don't exist yet." i.e. S3-native primitives must not privilege one engine.
- S3 Vectors (Nov 2025) fixed a different friction shape — vectors as a data type, not just as a database feature. Existing vector databases kept indices in memory or on SSD (right for low-latency live search, wrong for storage-first workloads where "vectors themselves were often more bytes than the data being indexed, stored on media many times more expensive"). S3 Vectors' design anchors on S3-object-like cost/performance/durability, with full elasticity from hundreds to billions of vectors and a simple always-available similarity-search API endpoint. See systems/s3-vectors.
- "EFS3" was a design dead end. The initial plan — "put EFS and S3 in a giant pot, simmer it for a bit, and get the best of both worlds" — produced six months of passionate whiteboard sessions with senior engineers that never converged. Every path ended in a "battle of unpalatable compromises": file or object had to give something up. "It was really frustrating."
- The breakthrough: treat the boundary as the feature. Coming back from Christmas 2024, the team reframed: instead of hiding the file/object boundary, make it explicit, inspectable, and programmable. File access becomes a presentation layer over S3 data. The team borrowed a term from git — stage and commit — to describe how changes accumulate at the file layer and get pushed down to S3 as a batch. "Being explicit about the boundary between file and object presentations is something that I did not expect at all when the team started working on S3 Files, and it's something that I've really come to love about the design." See concepts/boundary-as-feature, concepts/stage-and-commit, patterns/presentation-layer-over-storage.
- Four hard subproblems once the boundary is explicit. Each
required an explicit, asymmetric decision rather than a convergent
compromise:
(a) Consistency/atomicity — filesystems expect atomic rename +
directory rename; S3 has neither as a primitive. Resolution: allow
file-layer mutations to coexist with S3 semantics; stage changes,
commit as whole objects later. NFS close-to-open consistency on one
side, full S3 atomic-PUT strong consistency on the other, sync layer
connecting them.
(b) Authorization — IAM is rich and expensive-to-evaluate;
filesystem perms are cheap, handle-based, rely on directory
permissions and inodes (hard links!). Resolution: permissions
specified on the mount, enforced in the filesystem layer, mapped
across the two worlds; IAM remains as a backstop ("you can always
disable access at the S3 layer if you need to change a data
perimeter"); door open to multiple mounts over the same data with
different auth configs.
(c) Namespace semantics — "dreadful incongruity". Filesystems
have first-class
/path separators; S3's/is a suggestion (LIST accepts any delimiter). S3 can have objects ending in/(for 20 minutes the team considered "filerectories"). Resolution: admit the incongruity, let each side keep its native naming, and emit an event when an object or file can't be moved across — "clearly an example of downloading complexity onto the developer, but I think it's also a profoundly good example of that being the right thing to do." (d) Namespace performance asymmetry — filesystem metadata is directory-co-located and traversal-heavy; S3 metadata is flat and optimized for highly-parallel point queries (billions of objects in one "directory"). Resolution: maintain a file-optimized namespace (EFS) alongside S3's, synchronised, neither emulating the other. - The "multiphase, not concurrent" insight unlocked adoption. "It turns out that very few applications use both file and object interfaces concurrently on the same data at the same instant. The far more common pattern is multiphase. A data processing pipeline uses filesystem tools in one stage to produce output that's consumed by object-based applications in the next." This reframes the design goal: not concurrent-coherent file+object semantics, but same data in one place, with the right view for each access pattern, plus a synchronisation layer. Enables keeping existing object-semantics applications undisturbed — a non-negotiable given "enormous numbers of existing buckets serving applications that depend on S3's object semantics working exactly as documented."
- Mechanism: mount, lazy hydrate, stage, commit. On first directory access, S3 Files imports S3 metadata into a file- optimised namespace (EFS) — populated as a background scan, so mount-then-work is instantaneous even for multi-million-object buckets. Data is fetched on-read (<128 KB files hydrate fully on metadata import; larger files hydrate on access). File writes accumulate and commit back to S3 as a single PUT roughly every 60 seconds. S3→file sync is bidirectional; external S3 mutations show up in the filesystem view automatically. See concepts/lazy-hydration.
- Conflict policy is asymmetric: S3 wins. "If there is ever a conflict where files are modified from both places at the same time, S3 is the source of truth and the filesystem version moves to a lost+found directory with a CloudWatch metric identifying the event." Non-negotiable for preserving S3 semantics for existing apps; visibly lossy on the file side, but visibly lossy (metric + directory), not silently.
- Tiering: active-working-set proportional. "File data that hasn't been accessed in 30 days is evicted from the filesystem view but not deleted from S3, so storage costs stay proportional to your active working set." The file layer is a hot cache; S3 is the cold storage of record. Matches S3 Files' billing model to real usage.
- "Read bypass" — throughput escape hatch. High-throughput sequential reads automatically reroute off the NFS data path and instead issue parallel GETs directly to S3. Reported: 3 GB/s per client (with room to improve), terabits per second across many clients. This is the most performance-sensitive workload class (ML training, media transcoding, big sequential scans) getting the full-fat S3 throughput tier that CRT-based clients already get — without leaving the filesystem API.
- Honest about edges; the explicit boundary is what makes honesty possible. Known launch-day limitations: (a) Renames are expensive. S3 has no native rename → renaming a directory copies and deletes every object under that prefix. The mount-time warning fires when an intended mount would cover > 50 million objects. (b) No explicit commit control at launch. 60-second window "works for most workloads but we know it won't be enough for everyone" — future work on the stage-and-commit programmatic surface. (c) Some S3 keys aren't valid POSIX filenames and won't appear in the filesystem view; events surface them.
- Nine months of customer beta shaped the GA. The edges above are explicitly called out as things the team has "continued to evolve and iterate on with early customers."
- Thesis: storage outlives applications; make storage easy to attach to. "Applications will come and go, and as always, data outlives all of them. The role of effective storage systems has always been not just to safely store data, but also to help abstract and decouple it from individual applications." The explicit boundary is a programmable surface S3 can continue to evolve (richer commit control, pipeline integration), which is the forward bet — "stage and commit gives us a surface that we can continue to evolve."
Architectural diagrams / numbers¶
- Scale context for S3 reframing: S3 "stores exabytes of parquet data and averages over 25 million requests per second to that format alone." Over 2 million tables stored in S3 Tables today (~16 months after launch). S3 sends "over 300 billion event notifications every day" to serverless listeners — cited as an example of subsystem scale, and also the mechanism downstream systems (CRR, log processing, image transcoding) rely on for at-least-once semantics.
- S3 Files mechanism:
- Mount target: any bucket or prefix, from EC2 / containers / Lambda.
- First-access hydration: < 128 KB files pull data + metadata; larger files pull metadata only, data on-read.
- Commit interval: roughly every 60 seconds (single PUT per changed object).
- Eviction: 30 days idle → filesystem view evicts; S3 keeps.
- Warning threshold for rename cost: > 50M objects under mount.
- Read bypass: 3 GB/s per client, Tbps across many clients.
- Beta duration: ~9 months in customer beta before launch.
- S3 Vectors: "elastic, meaning that you can quickly create an index with only a few hundred records in it, and scale over time to billions of records." Design anchor: "performance, cost and durability profile that is very similar to S3 objects."
Caveats¶
- Post is a launch + design essay, not a technical paper. No published SLOs on commit latency, sync lag, or rename throughput. No internal system diagram of the stage/commit pipeline.
- 60-second commit window + "S3 wins on conflict" is a real semantic difference from a classical NFS mount; applications that do need bidirectional fine-grained coherence (the rare "concurrent" case) should not use S3 Files. Post acknowledges this.
- Rename cost is O(objects under prefix) at launch; no hint of a future platform-level rename primitive.
- S3 Vectors is namedropped with performance-cost-durability framing but no measured numbers and no internal architecture sketch — this post touches it only as context for the "S3 as multi-primitive storage platform" thesis.
- "Read bypass" is described in concept but the automatic-trigger heuristic is not specified; unclear when it engages for mixed workloads.
Links¶
- Raw:
raw/allthingsdistributed/2026-04-07-s3-files-and-the-changing-face-of-s3-9b9b736d.md - Original: https://www.allthingsdistributed.com/2026/04/s3-files-and-the-changing-face-of-s3.html
- S3 Files launch page: https://aws.amazon.com/s3/features/files/
- S3 Files technical docs: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-files.html
- pgvector — referenced Postgres vector extension.
- GATK4 — genomic analysis toolkit named in the origin story.
- bunnies — JS Legare's containerized-on-S3 genomics pipeline.
- "Sprocket" / burst-parallel computing — SoCC '18 paper cited for the burst-parallel framing.
- Companion: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — S3-at-19 "properties not API" thesis.
- Companion: sources/2025-02-25-allthingsdistributed-building-and-operating-s3 — Warfield's FAST '23 keynote on the physical/operational side of S3.