All Things Distributed (Werner Vogels)¶
All Things Distributed is the personal engineering blog of Werner Vogels, CTO of Amazon.com. Tier-1 on the sysdesign-wiki: canonical commentary on AWS service history, distributed-systems culture (PR/FAQs, tenets, Working Backwards), and the engineering rationale behind foundational AWS primitives. Content often includes primary-source material — internal documents, talks, and first-person retrospectives — not available elsewhere.
Scope and why Tier 1¶
- Direct CTO-level perspective on design decisions at AWS scale.
- Cross-references: every major AWS foundational paper/product (Dynamo, S3, Lambda, DynamoDB, EC2) gets a retrospective here.
- Companion to AWS re:Invent keynotes; "behind the curtain" material on how Amazon makes architectural decisions (narrative docs, tenets, working backwards).
- When a claim about AWS architecture is contested, an
allthingsdistributedpost is often the definitive source. - Occasionally publishes guest posts from AWS Distinguished Engineers (e.g. Andy Warfield on S3) with primary-source technical detail not available elsewhere.
Key systems (as referenced from this blog)¶
- systems/aws-ebs — network block storage for EC2 (2008); HDD→SSD→Nitro→SRD→custom-SSD arc; >140T ops/day today; sub-ms io2 Block Express.
- systems/nitro — AWS offload card + lightweight hypervisor family; VPC, EBS, encryption all moved off Xen dom0.
- systems/srd — data-center transport that replaces TCP for storage; multi-path, out-of-order, offload-friendly; also systems/ena-express for guest TCP.
- systems/aws-nitro-ssd — AWS custom SSDs, EBS-tailored.
- systems/physalia — EBS's config/control plane; "removes the control plane from the IO path."
- systems/xen — EC2's pre-Nitro hypervisor; its defaults capped hosts at 64 outstanding IOs.
- systems/aws-s3 — object storage; 19-year retrospective (2025) on simplicity as an architectural property, feature arcs (consistency, conditional writes, bucket limits, Tables); and FAST '23 keynote (2025-02-25) on physical/operational scale (HDD physics, heat management, ShardStore, durability reviews, ownership).
- systems/shardstore — S3's rewritten per-disk storage layer (Rust
- executable-spec lightweight formal verification).
- systems/s3-tables — managed-Iceberg first-class table resource (re:Invent 2024).
- systems/s3-vectors — elastic similarity-search indices as a first-class S3 primitive (re:Invent 2025); S3-object-like cost/performance/durability profile; hundreds → billions of vectors.
- systems/s3-files — NFS mount over any S3 bucket/prefix (2026-04-07); EFS-backed filesystem presentation; stage-and-commit translation to S3 objects; design-breakthrough origin of concepts/boundary-as-feature.
- systems/aws-efs — under-the-covers filesystem backing for S3 Files.
- systems/s3-express-one-zone — SSD, single-AZ latency tier (2023).
- systems/metabucket — S3's bucket-metadata subsystem.
- systems/aws-crt — Common Runtime; S3 client best-practice library.
- systems/apache-iceberg — open table format; the pattern S3 Tables absorbed.
- systems/apache-parquet — columnar on-object file format.
- systems/aws-lambda — serverless compute service; launch PR/FAQ published here at 10 years; decade-long network-topology retrofit (2026-04-22) disclosed the specific eBPF / iptables / boot-time-pre-creation techniques that unified snapshot + on-demand topologies and scaled snapshot networks 200 → 4,000 per worker.
- systems/aws-lambda-snapstart — Firecracker-snapshot-based cold-start acceleration (2022); the forcing function for the 2026-04-22 Lambda network-topology unification.
- systems/firecracker — Lambda's micro-VM isolation primitive; density unlock for multi-tenant serverless. The 2026-04-22 Lambda networking post shows that Firecracker's density is only realizable with a re-engineered network topology.
- systems/ebpf — the user-space-loadable kernel-program substrate Lambda chose over DPDK and custom-kernel-patches for the 2026-04-22 Geneve-VNI-rewrite and stateless-NAT fixes (the wiki's first AWS data-plane eBPF disclosure).
- systems/aurora-dsql — serverless distributed SQL (re:Invent 2024); single-journal-per-commit + Crossbar subscription router; 100% JVM → 100% Rust journey.
- systems/postgresql — DSQL extends Postgres via public extension API rather than forking.
- systems/aws-sagemaker-ai — AWS's unified managed ML platform (2017 launch); umbrella for Studio, spaces, notebooks, managed training, hosting, and HyperPod.
- systems/aws-sagemaker-hyperpod — SageMaker's large-scale distributed-training / inference compute substrate; surface for observability + model-deployment + training-operator changes (2025).
- systems/aws-systems-manager — SSM Session Manager is the
substrate under SageMaker AI's
StartSessionSSH-over-SSM tunnel. - systems/ussd — 1990s GSM stateful session-based menu protocol; 2G, no data plan, $20 feature phones; Werner's Oct 2025 thesis-post entity.
- systems/mpesa — Safaricom mobile-money platform on AWS; 4K TPS, real-time ML fraud detection, >$100B processed in 2024; introduced in Werner's USSD post as the flagship patterns/feature-phone-frontend instance.
- systems/koko-networks — Sub-Saharan bioethanol cooking-fuel IoT network; 700+ cloud-connected KOKOpoint stations; same USSD/ feature-phone customer edge applied to physical-goods retail.
- systems/bedrock-guardrails-automated-reasoning-checks — Bedrock capability that verifies AI outputs against a customer-authored specification; up to 99% provable accuracy; finance / healthcare / government target industries.
- systems/bedrock-agentcore — AWS agent runtime for mechanically enforcing capability envelopes on agentic systems; the enforcement half of patterns/envelope-and-verify.
- systems/kiro — AWS's specification-driven development tool; flagship surface for agentic coding + formal proof combined.
- systems/lean — interactive theorem prover founded and led by Leo de Moura (at Amazon); named by Cook as the most promising AI-reliability development; DeepSeek combines Lean + RL.
- systems/aws-policy-interpreter — decade of automated-reasoning proof over IAM / Cedar semantics; proofs now extend to agent- generated policy changes.
Key patterns / concepts¶
- concepts/queueing-theory — the bank-metaphor framing of IO stacks; why spreading a hot tenant across many spindles widens the blast radius.
- concepts/noisy-neighbor, concepts/performance-isolation — the central quality problem in multi-tenant storage; 15 years of EBS design is iterative variance elimination.
- concepts/hardware-offload — Nitro as a queue-reduction + CPU-reclamation + hypervisor-isolation lever, not just a perf lever.
- concepts/incremental-delivery — "series of incremental improvements over time" as EBS's explicit delivery posture.
- patterns/full-stack-instrumentation, patterns/loopback-isolation — measurement-first engineering at the storage-IO layer.
- patterns/hot-swap-retrofit, patterns/nondisruptive-migration — fleet-upgrade-in-flight primitives; taping SSDs into every HDD server in 2013.
- patterns/peer-debugging — Marc Olson's "I had become the bottleneck" scaling-people shift.
- patterns/pr-faq-writing — Amazon's working-backwards narrative doc practice, as articulated on this blog.
- patterns/customer-driven-prioritization — the S3 team's default feature-selection posture; "almost everything we do has been in direct response to requests from S3 customers."
- patterns/conditional-write — CAS on object storage; S3 GA 2024.
- patterns/durability-review — S3's gated threat-model review for durability-affecting changes.
- patterns/executable-specification — same-language spec checked into repo, continuously validated via property-based testing (ShardStore).
- patterns/data-placement-spreading — place a bucket's objects on disjoint drive sets; one customer's data is a tiny fraction of any one drive.
- patterns/redundancy-for-heat — replicas and EC shards as I/O-steering degrees of freedom, not just durability mechanisms.
- concepts/elasticity — capacity + performance elasticity as S3's core property; scale-to-zero as its compute analogue on Lambda.
- concepts/strong-consistency — S3 read-after-write, Dec 2020; framed as a code-deletion feature.
- concepts/immutable-object-storage — S3's base data model.
- concepts/boundary-as-feature — when two abstractions differ on load-bearing semantics, design an explicit inspectable translation surface rather than a hidden convergence layer; origin lesson from S3 Files (2026).
- concepts/stage-and-commit — file-side changes accumulate, batch- commit to object-side roughly every 60s; term borrowed from git; programmable boundary primitive.
- concepts/file-vs-object-semantics — five axes of asymmetry (mutation granularity, atomicity, authorization, namespace semantics, namespace performance); S3 Files' enumerated design constraints.
- concepts/lazy-hydration — metadata-first, on-read data fetch; makes mount-and-work instantaneous on multi-million-object buckets.
- concepts/agentic-data-access — as agentic coding compresses application lifetimes, storage's role as the decoupled-from- applications stable layer grows; friction between agent and data amplifies into reasoning overhead.
- patterns/presentation-layer-over-storage — S3's 2024-2026 multi-primitive direction (objects + tables + vectors + files); one storage tier, many first-class presentations.
- patterns/explicit-boundary-translation — implementation pattern for boundary-as-feature: asymmetric consistency contracts, declared cadence + conflict policy, visible failure on non-translatable data, programmable surface.
- concepts/heat-management — S3's placement problem: minimize hotspots across millions of drives.
- concepts/hard-drive-physics — HDD capacity grows fast, seek-time flat; ~120 IOPS/drive, 200 TB/drive roadmap → 1 IOPS/2 TB.
- concepts/erasure-coding — Reed-Solomon (k, m) over Parquet/S3; dual-purpose as durability and heat-steering primitive.
- concepts/aggregate-demand-smoothing — millions of bursty tenants aggregate smooth; scale as a quality lever.
- concepts/lightweight-formal-verification — ShardStore's executable-spec approach (SOSP'21); "industrialized" verification.
- concepts/threat-modeling — security-origin; generalized to durability reviews in S3.
- concepts/ownership — Amazon's organizational primitive; "AWS ships its org chart" applied.
- concepts/open-table-format — Iceberg/Delta/Hudi as a class of metadata layer over immutable objects.
- concepts/simplicity-vs-velocity — first-class engineering concept in Warfield's 2025 S3 retrospective.
- concepts/serverless-compute, concepts/scale-to-zero, concepts/fine-grained-billing, concepts/stateless-compute, concepts/cold-start, concepts/micro-vm-isolation — the full serverless architectural vocabulary, as Amazon framed it at Lambda launch and refined over 10 years.
- patterns/launch-minimal-runtime — Node-first Lambda launch strategy.
- patterns/ebpf-header-rewrite-on-egress, patterns/pre-create-all-network-slots-at-boot, patterns/per-slot-iptables-in-namespace, patterns/encapsulate-optimization-as-internal-service — the four canonical patterns disclosed in the 2026-04-22 Lambda networking retrofit: Geneve VNI rewrite via eBPF (150 ms → 200 μs), boot-time 4,000-slot pre-creation (constant-work), per-slot iptables scoped into namespaces (125 k → 144 root rules), and Lambda networking productised as a versioned internal service that Aurora DSQL consumes unchanged.
- concepts/geneve-tunnel-vni, concepts/rtnl-lock-contention, concepts/constant-work-pattern, concepts/double-nat, concepts/stateless-nat-via-ebpf — the specific kernel / Linux concepts named in the 2026-04-22 Lambda networking post; first canonical wiki disclosure of Colm MacCárthaigh's constant-work principle as a named AWS design rule.
- patterns/pilot-component-language-migration — DSQL's Adjudicator-first Rust pilot; 10× TPS result licensed broader rewrite.
- patterns/postgres-extension-over-fork — DSQL's approach to building on Postgres without forking.
- concepts/tail-latency-at-scale — the Marc Brooker "tail at scale" result; forcing function behind DSQL's JVM → Rust move.
- concepts/memory-safety — Rust-over-C rationale for DSQL's Postgres extensions.
- concepts/grey-failure, concepts/monitoring-paradox, concepts/training-serving-boundary — vocabulary named in the SageMaker AI friction-removal post (2025-08-06); grey failure as partial/intermittent degradation (GPU thermal throttle, NIC packet loss); monitoring paradox as the observability stack causing the failure it exists to catch; train/serve boundary as a historical artefact HyperPod's model-deployment collapses.
- patterns/secure-tunnel-to-managed-compute, patterns/auto-scaling-telemetry-collector, patterns/partial-restart-fault-recovery — the three structural patterns packaged by SageMaker AI / HyperPod's 2025 friction-removal release.
- concepts/appropriate-technology — Werner's Oct 2025 "suitable not shiny" doctrine; customer's constraints as the specification; corollary of concepts/simplicity-vs-velocity read from the customer side; invisibility as highest compliment.
- patterns/feature-phone-frontend — thin USSD edge + sophisticated cloud backend; the M-Pesa / Moniepoint / KOKO Networks shape.
- patterns/post-inference-verification — LLM generate → automated-reasoning check → pass / filter; Bedrock Guardrails' automated reasoning checks is the canonical AWS realization.
- patterns/envelope-and-verify — three-part discipline for high-stakes agentic AI: (1) specify the envelope (often temporal), (2) restrict the agent to it via AgentCore, (3) reason about the composition of envelopes against global invariants.
- concepts/automated-reasoning — mechanical proof of system properties against formal specifications; the decade-of-AWS-proof portfolio (policy interpreter, crypto, networking, virtualization, pan-Amazon data flow, ShardStore).
- concepts/neurosymbolic-ai — neural + symbolic composition as Cook's named path to production AI trust; four composition shapes (RL-over-prover, post-inference filter, in-loop tool cooperation, envelope+composition-reasoning).
- concepts/specification-driven-development — specifications as first-class customer-visible artifacts; Kiro + Bedrock Guardrails checks as the productized surfaces; autoformalization as the remaining UX bottleneck.
- concepts/temporal-logic-specification — LTL / CTL / past-time / future-time / epistemic / causal operators; Cook predicts customers will learn and demand these distinctions from spec-driven tools.
Recent articles¶
-
2026-04-22 — sources/2026-04-22-allthingsdistributed-invisible-engineering-behind-lambdas-network (Werner Vogels' retrospective on the AWS Lambda networking team's decade-long silent retrofit, told through Marc Olson's propeller-to-jet-in-flight metaphor and framed around Aristotle's arete. The decade-long arc: 2019 Firecracker migration cut cold-start >10s→<1s; VPC cold-start still paid ~300 ms for Geneve tunnel + DHCP; 2022 SnapStart launch forced a second network-topology rebuild (clones need pre-created network namespaces); unified topology 2026 scales snapshot networks 200 → 4,000 per worker (20×) with 3-min boot cost and −1% fleet-wide CPU. Specific kernel + eBPF techniques disclosed: (1) eBPF-based Geneve header rewrite (patterns/ebpf-header-rewrite-on-egress, concepts/geneve-tunnel-vni) cut tunnel latency 150 ms → 200 μs (~750×) — tunnels pre-created with dummy VNIs, eBPF rewrites to real VNI once function init produces it, reverses on ingress. Lambda rejected a custom kernel driver to avoid "maintaining Lambda-specific patches upstream indefinitely" (patterns/upstream-the-fix); eBPF chosen over DPDK on lower-overhead axis, with Cilium cited as the at-scale existence proof that de-risked the bet. (2) Stateless NAT via eBPF (concepts/stateless-nat-via-ebpf) replaced the dual-stage stateful iptables + conntrack at 100× lower setup latency. (3) Per-slot iptables moved into each slot's network namespace (patterns/per-slot-iptables-in-namespace) compressed root-namespace rules from 125,000+ to 144 static slot-agnostic rules; rule-traversal cost became constant instead of scaling with slot index (worst case was ~1 ms/packet at slot 4,000). (4) All 4,000 networks pre-created at worker boot (patterns/pre-create-all-network-slots-at-boot) instead of on-demand — canonical wiki instance of Colm MacCárthaigh's constant-work principle from the AWS Builders' Library; the amortization works because worker lifetime ≫ micro-VM lifetime. (5) RTNL-lock-friendly ordering (concepts/rtnl-lock-contention) — pool namespaces first, create veth inside namespace, batch eBPF attachments — turned parallel-create-4,000-networks from "minutes" back to "seconds." DHCP is still open — "a multi-phase effort the team is currently working through" — so Geneve is the compressed portion of the 300 ms VPC cold-start overhead, not the whole thing. Productization arc: the full Lambda networking stack was encapsulated as an internal service that Aurora DSQL now consumes (patterns/encapsulate-optimization-as-internal-service) — DSQL requests/uses/releases network slots via a Lambda-owned service; Lambda vends new versions and every optimization flows to DSQL automatically, "saved the DSQL team months of engineering effort." Canonical wiki disclosure that DSQL consumes Lambda's networking substrate as a managed internal service, not as a copy of the stack. Thesis: success is silent; "optimizing iptables rules, working around kernel lock contention" doesn't make headlines, but "knowing what you've worked on is better today than it was a week ago, and that the next team won't run into the same constraints you just removed" is the reward. Extends Marc Olson's 2024-08 EBS retrospective voice and complements the Lambda PR/FAQ 10-year retrospective with the operational-scaling counterpart. Credited: Ravi Nagayach, Prashant Singh, Kshitij Gupta, and the entire Lambda networking team.)
-
2026-04-07 — sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3 (Andy Warfield guest post, introduced by Werner Vogels. Launch of systems/s3-files — NFS mount over any S3 bucket/prefix, backed by EFS, accessible from EC2 / containers / Lambda. Most of the post is the design story: six months of attempted "EFS3" convergence in 2024 produced a "battle of unpalatable compromises"; post-Christmas- 2024 the team inverted the goal — the boundary between file and object semantics IS the feature, not a limitation to hide. Origin and canonical articulation of concepts/boundary-as-feature ("we spent months trying to make it disappear, and when we finally accepted it as a first-class element of the system, everything got better"). Architecture: concepts/stage-and-commit translation layer — file-side changes accumulate in EFS, commit back to S3 as one PUT per changed object roughly every 60 seconds; bidirectional sync; conflict policy: S3 wins, filesystem-side loser → lost+found + CloudWatch metric. concepts/lazy-hydration — first access imports S3 metadata as background scan, files < 128 KB co-hydrate data, larger files hydrate on read; 30-day idle eviction keeps active working set proportional. Read bypass reroutes high-throughput sequential reads off NFS to parallel direct-GETs against S3 — 3 GB/s per client, Tbps across many clients. Enumerates five axes of concepts/file-vs-object-semantics asymmetry (mutation granularity / atomicity / auth / namespace / performance) more exhaustively than any prior AWS source. Multiphase- not-concurrent insight: "very few applications use both file and object interfaces concurrently on the same data at the same instant." Known edges called out: rename is O(objects) (warning > 50M objects mount), no programmatic explicit-commit API at launch, some S3 keys aren't valid POSIX filenames. Multi-primitive lineage: S3 Files is the third new first-class data primitive added to S3 after systems/s3-tables (re:Invent 2024) and systems/s3-vectors (re:Invent 2025), following the patterns/presentation-layer-over-storage pattern. Named framing of concepts/agentic-data-access — as agentic coding compresses application lifetimes, storage's role as the stable data layer grows. Reported scale: 2M+ tables in S3 Tables today, 300B+ event notifications/day from S3, 25M+ req/s to Parquet data alone. 9 months of customer beta shaped the launch edges. Extends concepts/immutable-object-storage with a file-semantics escape hatch that preserves the object invariant rather than weakening it; concepts/simplicity-vs-velocity restated — "stage and commit gives us a surface that we can continue to evolve".)
-
2026-02-17 — sources/2026-02-17-allthingsdistributed-byron-cook-automated-reasoning-trust-ai (Werner Vogels interviews Byron Cook (Amazon Distinguished Scientist
-
VP) three and a half years after their first automated-reasoning conversation. Thesis: trust is the production blocker for generative + agentic AI, and concepts/neurosymbolic-ai — mechanical theorem provers composed with LLMs — is the path to delivering it. Two enabling forces since 2022: LLMs are now trained over theorem-prover outputs (Isabelle/HOL-light/systems/lean) which dissolves the user-friction barrier; regulated-industry customers (finance/healthcare/government) now have concrete provability demands testing cannot answer. AWS ships systems/bedrock-guardrails-automated-reasoning-checks (up to 99% provable accuracy on AI outputs vs. a customer-supplied specification — realizes patterns/post-inference-verification), and systems/bedrock-agentcore as the runtime that mechanically enforces agent capability envelopes. Together with systems/kiro (spec authoring) these form Cook's three-part patterns/envelope-and-verify: specify the envelope, AgentCore enforces it, automated reasoning proves invariants over the composition. AWS's moat: a decade of proof over the systems/aws-policy-interpreter, cryptography, networking protocols, virtualization layer — and a 2025 pan-Amazon whole-service data-flow analyzer under CISO Amy Herzog reasoning about invariants like "data at rest is encrypted" / "credentials are never logged" — all of which now extends to reasoning about agentic-tool-generated code changes. Cook predicts specification becomes mainstream: customers will discover and demand branching-time vs linear-time, past-time vs future-time, epistemic, and causal operators from spec-driven tools — see concepts/temporal-logic-specification and concepts/specification-driven-development. Autoformalization (natural-language → formal spec) is the UX bottleneck — DARPA
expMathis the public research face; Kiro + Guardrails reasoning checks are the product face. Fundamental scaling limit — NP-complete / undecidable — addressed via distributed SAT (mallob) and LLM-guided proof search. Extends concepts/lightweight-formal-verification (S3/ShardStore case) to runtime AI-output verification and organization-wide invariant enforcement; concepts/threat-modeling shape generalizes a third time (security → durability → agent envelopes). Ecosystem: DeepSeek, DeepMind/Google pushing neurosymbolic; new startups Atalanta / Axiom Math / Harmonic.fun / Leibnitz.) -
2025-08-06 — sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development (Werner Vogels surveys four 2025 SageMaker AI capabilities that remove distinct friction points:
StartSessionAPI — productizes SSH-over-SSM tunnels into SageMaker Studio spaces, answering SageMaker's #1 feature request, so local VS Code attaches to managed compute without bastion hosts or hand-rolled tunnels (patterns/secure-tunnel-to-managed-compute); HyperPod observability — auto-scaling collectors replace CPU-bound single-threaded ones (patterns/auto-scaling-telemetry-collector), auto-correlate high-cardinality metrics, detect grey failures — GPU thermal throttling, NIC packet loss — not just binary ones (concepts/grey-failure); explicitly framed as an answer to the observability paradox where the monitoring stack itself becomes the failure source (concepts/monitoring-paradox); HyperPod model deployment — train + serve on the same GPU cluster, collapsing the historical training/serving infra boundary (concepts/training-serving-boundary); HyperPod training operator for Kubernetes — restart only affected resources not the whole job (patterns/partial-restart-fault-recovery); monitors stalled batches + non-numeric loss; YAML-defined recovery policies.) - 2025-05-27 — sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey (Werner hosts a guest post by Sr. Principal Engineers Niko Matsakis and Marc Bowes on the engineering journey of systems/aurora-dsql: how they scaled writes without 2PC — single-journal-per-commit plus a novel Crossbar subscription router — and why DSQL moved from 100% JVM / Kotlin to 100% Rust, driven by concepts/tail-latency-at-scale math (40-host simulation: ~6K TPS vs. ~1M target, 10s tail vs. 1s) and concepts/memory-safety economics on new extension code. DSQL uses Postgres via its public extension API rather than forking. Retracts the earlier "Kotlin control plane, Rust data plane" split in favor of unified Rust.)
- 2025-03-14 — sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes (S3 at 19. Andy Warfield reframes "simple" as a property of the experience, not the API: elasticity, strong consistency, conditional writes, bucket-limit rewrite, SSD/low-latency class, and S3 Tables as the object→table-as-first-class-resource move. Canonical statement that the properties of S3 storage, not the object API, define the system.)
- 2025-02-25 — sources/2025-02-25-allthingsdistributed-building-and-operating-s3 (Andy Warfield's FAST '23 keynote, republished on ATD. The physical/operational counterpart to the 2025-03-14 "simplicity" post. HDD physics — ~120 IOPS/drive flat since 2006, 200 TB drives incoming → 1 IOPS per 2 TB. Heat management as placement problem. Aggregate demand smooths over millions of bursty tenants. Spread placement + redundancy-for-heat → single customer bursts onto 1M+ disks. Org: hundreds of microservices, "AWS ships its org chart." Durability reviews as threat-model for durability changes. ShardStore rewritten in Rust with a ~1%-size executable spec checked into the same repo → lightweight formal verification as an industrialized guardrail, SOSP paper. Ownership as a people-scaling lever — "my best ideas are the ones that other people have instead of me.")
- 2024-11-15 — sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years (The internal PR/FAQ that launched AWS Lambda, re-published at 10 years with annotations — what shipped as written, what evolved, what was deferred. Canonical artefact of Amazon's PR/FAQ doc culture.)
- 2024-08-22 — sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws (Marc Olson, guest post. 13-year insider retrospective on systems/aws-ebs: queueing theory framing; HDD→SSD (2012 Provisioned IOPS, 1k IOPS / 2-3ms); instrumentation turnaround; the systems/xen ring-default that capped hosts at 64 outstanding IOs; first and second systems/nitro offload cards; systems/srd replaces TCP for storage and becomes systems/ena-express for guests; custom systems/aws-nitro-ssd; the 2013 patterns/hot-swap-retrofit where SSDs were taped into every HDD server with zero disruption; patterns/nondisruptive-migration as a compounding primitive; Olson's personal shift from deep-diving-everything to patterns/peer-debugging leadership. Today: >140T ops/day, sub-ms io2 Block Express latency.)