Skip to content

All Things Distributed (Werner Vogels)

All Things Distributed is the personal engineering blog of Werner Vogels, CTO of Amazon.com. Tier-1 on the sysdesign-wiki: canonical commentary on AWS service history, distributed-systems culture (PR/FAQs, tenets, Working Backwards), and the engineering rationale behind foundational AWS primitives. Content often includes primary-source material — internal documents, talks, and first-person retrospectives — not available elsewhere.

Scope and why Tier 1

  • Direct CTO-level perspective on design decisions at AWS scale.
  • Cross-references: every major AWS foundational paper/product (Dynamo, S3, Lambda, DynamoDB, EC2) gets a retrospective here.
  • Companion to AWS re:Invent keynotes; "behind the curtain" material on how Amazon makes architectural decisions (narrative docs, tenets, working backwards).
  • When a claim about AWS architecture is contested, an allthingsdistributed post is often the definitive source.
  • Occasionally publishes guest posts from AWS Distinguished Engineers (e.g. Andy Warfield on S3) with primary-source technical detail not available elsewhere.

Key systems (as referenced from this blog)

  • systems/aws-ebs — network block storage for EC2 (2008); HDD→SSD→Nitro→SRD→custom-SSD arc; >140T ops/day today; sub-ms io2 Block Express.
  • systems/nitro — AWS offload card + lightweight hypervisor family; VPC, EBS, encryption all moved off Xen dom0.
  • systems/srd — data-center transport that replaces TCP for storage; multi-path, out-of-order, offload-friendly; also systems/ena-express for guest TCP.
  • systems/aws-nitro-ssd — AWS custom SSDs, EBS-tailored.
  • systems/physalia — EBS's config/control plane; "removes the control plane from the IO path."
  • systems/xen — EC2's pre-Nitro hypervisor; its defaults capped hosts at 64 outstanding IOs.
  • systems/aws-s3 — object storage; 19-year retrospective (2025) on simplicity as an architectural property, feature arcs (consistency, conditional writes, bucket limits, Tables); and FAST '23 keynote (2025-02-25) on physical/operational scale (HDD physics, heat management, ShardStore, durability reviews, ownership).
  • systems/shardstore — S3's rewritten per-disk storage layer (Rust
  • executable-spec lightweight formal verification).
  • systems/s3-tables — managed-Iceberg first-class table resource (re:Invent 2024).
  • systems/s3-vectors — elastic similarity-search indices as a first-class S3 primitive (re:Invent 2025); S3-object-like cost/performance/durability profile; hundreds → billions of vectors.
  • systems/s3-files — NFS mount over any S3 bucket/prefix (2026-04-07); EFS-backed filesystem presentation; stage-and-commit translation to S3 objects; design-breakthrough origin of concepts/boundary-as-feature.
  • systems/aws-efs — under-the-covers filesystem backing for S3 Files.
  • systems/s3-express-one-zone — SSD, single-AZ latency tier (2023).
  • systems/metabucket — S3's bucket-metadata subsystem.
  • systems/aws-crt — Common Runtime; S3 client best-practice library.
  • systems/apache-iceberg — open table format; the pattern S3 Tables absorbed.
  • systems/apache-parquet — columnar on-object file format.
  • systems/aws-lambda — serverless compute service; launch PR/FAQ published here at 10 years; decade-long network-topology retrofit (2026-04-22) disclosed the specific eBPF / iptables / boot-time-pre-creation techniques that unified snapshot + on-demand topologies and scaled snapshot networks 200 → 4,000 per worker.
  • systems/aws-lambda-snapstart — Firecracker-snapshot-based cold-start acceleration (2022); the forcing function for the 2026-04-22 Lambda network-topology unification.
  • systems/firecracker — Lambda's micro-VM isolation primitive; density unlock for multi-tenant serverless. The 2026-04-22 Lambda networking post shows that Firecracker's density is only realizable with a re-engineered network topology.
  • systems/ebpf — the user-space-loadable kernel-program substrate Lambda chose over DPDK and custom-kernel-patches for the 2026-04-22 Geneve-VNI-rewrite and stateless-NAT fixes (the wiki's first AWS data-plane eBPF disclosure).
  • systems/aurora-dsql — serverless distributed SQL (re:Invent 2024); single-journal-per-commit + Crossbar subscription router; 100% JVM → 100% Rust journey.
  • systems/postgresql — DSQL extends Postgres via public extension API rather than forking.
  • systems/aws-sagemaker-ai — AWS's unified managed ML platform (2017 launch); umbrella for Studio, spaces, notebooks, managed training, hosting, and HyperPod.
  • systems/aws-sagemaker-hyperpod — SageMaker's large-scale distributed-training / inference compute substrate; surface for observability + model-deployment + training-operator changes (2025).
  • systems/aws-systems-manager — SSM Session Manager is the substrate under SageMaker AI's StartSession SSH-over-SSM tunnel.
  • systems/ussd — 1990s GSM stateful session-based menu protocol; 2G, no data plan, $20 feature phones; Werner's Oct 2025 thesis-post entity.
  • systems/mpesa — Safaricom mobile-money platform on AWS; 4K TPS, real-time ML fraud detection, >$100B processed in 2024; introduced in Werner's USSD post as the flagship patterns/feature-phone-frontend instance.
  • systems/koko-networks — Sub-Saharan bioethanol cooking-fuel IoT network; 700+ cloud-connected KOKOpoint stations; same USSD/ feature-phone customer edge applied to physical-goods retail.
  • systems/bedrock-guardrails-automated-reasoning-checks — Bedrock capability that verifies AI outputs against a customer-authored specification; up to 99% provable accuracy; finance / healthcare / government target industries.
  • systems/bedrock-agentcore — AWS agent runtime for mechanically enforcing capability envelopes on agentic systems; the enforcement half of patterns/envelope-and-verify.
  • systems/kiro — AWS's specification-driven development tool; flagship surface for agentic coding + formal proof combined.
  • systems/lean — interactive theorem prover founded and led by Leo de Moura (at Amazon); named by Cook as the most promising AI-reliability development; DeepSeek combines Lean + RL.
  • systems/aws-policy-interpreter — decade of automated-reasoning proof over IAM / Cedar semantics; proofs now extend to agent- generated policy changes.

Key patterns / concepts

Recent articles

  • 2026-04-22 — sources/2026-04-22-allthingsdistributed-invisible-engineering-behind-lambdas-network (Werner Vogels' retrospective on the AWS Lambda networking team's decade-long silent retrofit, told through Marc Olson's propeller-to-jet-in-flight metaphor and framed around Aristotle's arete. The decade-long arc: 2019 Firecracker migration cut cold-start >10s→<1s; VPC cold-start still paid ~300 ms for Geneve tunnel + DHCP; 2022 SnapStart launch forced a second network-topology rebuild (clones need pre-created network namespaces); unified topology 2026 scales snapshot networks 200 → 4,000 per worker (20×) with 3-min boot cost and −1% fleet-wide CPU. Specific kernel + eBPF techniques disclosed: (1) eBPF-based Geneve header rewrite (patterns/ebpf-header-rewrite-on-egress, concepts/geneve-tunnel-vni) cut tunnel latency 150 ms → 200 μs (~750×) — tunnels pre-created with dummy VNIs, eBPF rewrites to real VNI once function init produces it, reverses on ingress. Lambda rejected a custom kernel driver to avoid "maintaining Lambda-specific patches upstream indefinitely" (patterns/upstream-the-fix); eBPF chosen over DPDK on lower-overhead axis, with Cilium cited as the at-scale existence proof that de-risked the bet. (2) Stateless NAT via eBPF (concepts/stateless-nat-via-ebpf) replaced the dual-stage stateful iptables + conntrack at 100× lower setup latency. (3) Per-slot iptables moved into each slot's network namespace (patterns/per-slot-iptables-in-namespace) compressed root-namespace rules from 125,000+ to 144 static slot-agnostic rules; rule-traversal cost became constant instead of scaling with slot index (worst case was ~1 ms/packet at slot 4,000). (4) All 4,000 networks pre-created at worker boot (patterns/pre-create-all-network-slots-at-boot) instead of on-demand — canonical wiki instance of Colm MacCárthaigh's constant-work principle from the AWS Builders' Library; the amortization works because worker lifetime ≫ micro-VM lifetime. (5) RTNL-lock-friendly ordering (concepts/rtnl-lock-contention) — pool namespaces first, create veth inside namespace, batch eBPF attachments — turned parallel-create-4,000-networks from "minutes" back to "seconds." DHCP is still open"a multi-phase effort the team is currently working through" — so Geneve is the compressed portion of the 300 ms VPC cold-start overhead, not the whole thing. Productization arc: the full Lambda networking stack was encapsulated as an internal service that Aurora DSQL now consumes (patterns/encapsulate-optimization-as-internal-service) — DSQL requests/uses/releases network slots via a Lambda-owned service; Lambda vends new versions and every optimization flows to DSQL automatically, "saved the DSQL team months of engineering effort." Canonical wiki disclosure that DSQL consumes Lambda's networking substrate as a managed internal service, not as a copy of the stack. Thesis: success is silent; "optimizing iptables rules, working around kernel lock contention" doesn't make headlines, but "knowing what you've worked on is better today than it was a week ago, and that the next team won't run into the same constraints you just removed" is the reward. Extends Marc Olson's 2024-08 EBS retrospective voice and complements the Lambda PR/FAQ 10-year retrospective with the operational-scaling counterpart. Credited: Ravi Nagayach, Prashant Singh, Kshitij Gupta, and the entire Lambda networking team.)

  • 2026-04-07 — sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3 (Andy Warfield guest post, introduced by Werner Vogels. Launch of systems/s3-files — NFS mount over any S3 bucket/prefix, backed by EFS, accessible from EC2 / containers / Lambda. Most of the post is the design story: six months of attempted "EFS3" convergence in 2024 produced a "battle of unpalatable compromises"; post-Christmas- 2024 the team inverted the goal — the boundary between file and object semantics IS the feature, not a limitation to hide. Origin and canonical articulation of concepts/boundary-as-feature ("we spent months trying to make it disappear, and when we finally accepted it as a first-class element of the system, everything got better"). Architecture: concepts/stage-and-commit translation layer — file-side changes accumulate in EFS, commit back to S3 as one PUT per changed object roughly every 60 seconds; bidirectional sync; conflict policy: S3 wins, filesystem-side loser → lost+found + CloudWatch metric. concepts/lazy-hydration — first access imports S3 metadata as background scan, files < 128 KB co-hydrate data, larger files hydrate on read; 30-day idle eviction keeps active working set proportional. Read bypass reroutes high-throughput sequential reads off NFS to parallel direct-GETs against S3 — 3 GB/s per client, Tbps across many clients. Enumerates five axes of concepts/file-vs-object-semantics asymmetry (mutation granularity / atomicity / auth / namespace / performance) more exhaustively than any prior AWS source. Multiphase- not-concurrent insight: "very few applications use both file and object interfaces concurrently on the same data at the same instant." Known edges called out: rename is O(objects) (warning > 50M objects mount), no programmatic explicit-commit API at launch, some S3 keys aren't valid POSIX filenames. Multi-primitive lineage: S3 Files is the third new first-class data primitive added to S3 after systems/s3-tables (re:Invent 2024) and systems/s3-vectors (re:Invent 2025), following the patterns/presentation-layer-over-storage pattern. Named framing of concepts/agentic-data-access — as agentic coding compresses application lifetimes, storage's role as the stable data layer grows. Reported scale: 2M+ tables in S3 Tables today, 300B+ event notifications/day from S3, 25M+ req/s to Parquet data alone. 9 months of customer beta shaped the launch edges. Extends concepts/immutable-object-storage with a file-semantics escape hatch that preserves the object invariant rather than weakening it; concepts/simplicity-vs-velocity restated — "stage and commit gives us a surface that we can continue to evolve".)

  • 2026-02-17 — sources/2026-02-17-allthingsdistributed-byron-cook-automated-reasoning-trust-ai (Werner Vogels interviews Byron Cook (Amazon Distinguished Scientist

  • VP) three and a half years after their first automated-reasoning conversation. Thesis: trust is the production blocker for generative + agentic AI, and concepts/neurosymbolic-ai — mechanical theorem provers composed with LLMs — is the path to delivering it. Two enabling forces since 2022: LLMs are now trained over theorem-prover outputs (Isabelle/HOL-light/systems/lean) which dissolves the user-friction barrier; regulated-industry customers (finance/healthcare/government) now have concrete provability demands testing cannot answer. AWS ships systems/bedrock-guardrails-automated-reasoning-checks (up to 99% provable accuracy on AI outputs vs. a customer-supplied specification — realizes patterns/post-inference-verification), and systems/bedrock-agentcore as the runtime that mechanically enforces agent capability envelopes. Together with systems/kiro (spec authoring) these form Cook's three-part patterns/envelope-and-verify: specify the envelope, AgentCore enforces it, automated reasoning proves invariants over the composition. AWS's moat: a decade of proof over the systems/aws-policy-interpreter, cryptography, networking protocols, virtualization layer — and a 2025 pan-Amazon whole-service data-flow analyzer under CISO Amy Herzog reasoning about invariants like "data at rest is encrypted" / "credentials are never logged" — all of which now extends to reasoning about agentic-tool-generated code changes. Cook predicts specification becomes mainstream: customers will discover and demand branching-time vs linear-time, past-time vs future-time, epistemic, and causal operators from spec-driven tools — see concepts/temporal-logic-specification and concepts/specification-driven-development. Autoformalization (natural-language → formal spec) is the UX bottleneck — DARPA expMath is the public research face; Kiro + Guardrails reasoning checks are the product face. Fundamental scaling limit — NP-complete / undecidable — addressed via distributed SAT (mallob) and LLM-guided proof search. Extends concepts/lightweight-formal-verification (S3/ShardStore case) to runtime AI-output verification and organization-wide invariant enforcement; concepts/threat-modeling shape generalizes a third time (security → durability → agent envelopes). Ecosystem: DeepSeek, DeepMind/Google pushing neurosymbolic; new startups Atalanta / Axiom Math / Harmonic.fun / Leibnitz.)

  • 2025-08-06 — sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development (Werner Vogels surveys four 2025 SageMaker AI capabilities that remove distinct friction points: StartSession API — productizes SSH-over-SSM tunnels into SageMaker Studio spaces, answering SageMaker's #1 feature request, so local VS Code attaches to managed compute without bastion hosts or hand-rolled tunnels (patterns/secure-tunnel-to-managed-compute); HyperPod observability — auto-scaling collectors replace CPU-bound single-threaded ones (patterns/auto-scaling-telemetry-collector), auto-correlate high-cardinality metrics, detect grey failures — GPU thermal throttling, NIC packet loss — not just binary ones (concepts/grey-failure); explicitly framed as an answer to the observability paradox where the monitoring stack itself becomes the failure source (concepts/monitoring-paradox); HyperPod model deployment — train + serve on the same GPU cluster, collapsing the historical training/serving infra boundary (concepts/training-serving-boundary); HyperPod training operator for Kubernetes — restart only affected resources not the whole job (patterns/partial-restart-fault-recovery); monitors stalled batches + non-numeric loss; YAML-defined recovery policies.)

  • 2025-05-27 — sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey (Werner hosts a guest post by Sr. Principal Engineers Niko Matsakis and Marc Bowes on the engineering journey of systems/aurora-dsql: how they scaled writes without 2PC — single-journal-per-commit plus a novel Crossbar subscription router — and why DSQL moved from 100% JVM / Kotlin to 100% Rust, driven by concepts/tail-latency-at-scale math (40-host simulation: ~6K TPS vs. ~1M target, 10s tail vs. 1s) and concepts/memory-safety economics on new extension code. DSQL uses Postgres via its public extension API rather than forking. Retracts the earlier "Kotlin control plane, Rust data plane" split in favor of unified Rust.)
  • 2025-03-14 — sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes (S3 at 19. Andy Warfield reframes "simple" as a property of the experience, not the API: elasticity, strong consistency, conditional writes, bucket-limit rewrite, SSD/low-latency class, and S3 Tables as the object→table-as-first-class-resource move. Canonical statement that the properties of S3 storage, not the object API, define the system.)
  • 2025-02-25 — sources/2025-02-25-allthingsdistributed-building-and-operating-s3 (Andy Warfield's FAST '23 keynote, republished on ATD. The physical/operational counterpart to the 2025-03-14 "simplicity" post. HDD physics — ~120 IOPS/drive flat since 2006, 200 TB drives incoming → 1 IOPS per 2 TB. Heat management as placement problem. Aggregate demand smooths over millions of bursty tenants. Spread placement + redundancy-for-heat → single customer bursts onto 1M+ disks. Org: hundreds of microservices, "AWS ships its org chart." Durability reviews as threat-model for durability changes. ShardStore rewritten in Rust with a ~1%-size executable spec checked into the same repo → lightweight formal verification as an industrialized guardrail, SOSP paper. Ownership as a people-scaling lever — "my best ideas are the ones that other people have instead of me.")
  • 2024-11-15 — sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years (The internal PR/FAQ that launched AWS Lambda, re-published at 10 years with annotations — what shipped as written, what evolved, what was deferred. Canonical artefact of Amazon's PR/FAQ doc culture.)
  • 2024-08-22 — sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws (Marc Olson, guest post. 13-year insider retrospective on systems/aws-ebs: queueing theory framing; HDD→SSD (2012 Provisioned IOPS, 1k IOPS / 2-3ms); instrumentation turnaround; the systems/xen ring-default that capped hosts at 64 outstanding IOs; first and second systems/nitro offload cards; systems/srd replaces TCP for storage and becomes systems/ena-express for guests; custom systems/aws-nitro-ssd; the 2013 patterns/hot-swap-retrofit where SSDs were taped into every HDD server with zero disruption; patterns/nondisruptive-migration as a compounding primitive; Olson's personal shift from deep-diving-everything to patterns/peer-debugging leadership. Today: >140T ops/day, sub-ms io2 Block Express latency.)
Last updated · 542 distilled / 1,571 read