Skip to content

ALLTHINGSDISTRIBUTED 2026-04-22 Tier 1

Read original ↗

All Things Distributed: The invisible engineering behind Lambda's network

Summary

Werner Vogels tells the decade-long story of the AWS Lambda networking team — a silent infrastructure retrofit on the jet-in-flight scale: converting Lambda's network topology from the original VPC-path-heavy design to a single unified topology supporting both traditional and SnapStart workloads at 4,000 micro-VMs per worker (20× the launch cap), while the platform continued running customer traffic at full scale. The post is notable as a primary-source disclosure of the specific kernel + eBPF + iptables techniques Lambda used to collapse 300 ms of VPC cold-start overhead, kill the RTNL-lock stall, compress 125,000 iptables rules to 144, and ship it as a reusable service that Aurora DSQL now consumes unchanged. Architecturally it is an end-to-end case study in constant work, boot-time amortization, and eBPF-as-kernel-patch-alternative.

Key takeaways

  1. The VPC cold-start bottleneck was Geneve tunnel creation — 300 ms of platform overhead per VPC cold start. The Firecracker migration in 2019 cut Lambda cold-start overhead from >10 s to <1 s, but a VPC cold start still paid ~300 ms to set up the Geneve tunnel routing the micro-VM's traffic to the customer's VPC, plus DHCP. At higher micro-VM densities the team observed tail latency growing from hundreds of ms to seconds; instrumenting the full path isolated tunnel creation itself as the dominant contributor, not tunnel traversal. See concepts/cold-start.
  2. Geneve's load-bearing detail: the Virtual Network Identifier (VNI) is set at tunnel creation time and Linux offers no way to update it afterward. Every Geneve packet carries a VNI, the tunnel is created once with a fixed VNI, and Lambda doesn't know the real VNI until function initialization runs inside the micro-VM — which is too late if you want tunnels pre-created. New concept: concepts/geneve-tunnel-vni.
  3. The in-kernel fix was considered and deliberately rejected: a custom Linux kernel driver would have worked but locked Lambda into "maintaining Lambda-specific patches upstream indefinitely." Tradeoff preference is explicit — keep the kernel on the upstream path, add flexibility in user-space-loaded programs. Extension of patterns/upstream-the-fix: if you won't upstream, prefer eBPF over a fork.
  4. eBPF beat DPDK on the less-overhead / more-control axis — and on the validation-cost axis: Cilium had already cleared the "will eBPF hold up at scale + pass security review" question. Lambda was among the first AWS services to run eBPF in production for a control-plane-shape workload.
  5. The eBPF trick: create tunnels with dummy VNIs while pooled, rewrite Geneve headers on egress (dummy → real) and reverse on ingress, once the function initializes and the real VNI is known. Result: Geneve tunnel latency 150 ms → 200 μs, a ~750× reduction. Tunnel creation moved off the cold-start hot path entirely. New pattern: patterns/ebpf-header-rewrite-on-egress (a canonical shape for applying packet-header mutations that must be deferred past some future initialization step). See systems/ebpf.
  6. Secondary wins fell out of the same change. The team "had also removed a fundamental blocker for packing more micro-VMs onto each worker, and reduced a source of CPU heat during bursts of cold starts, which improved the platform's ability to absorb traffic spikes and handle scenarios like availability zone evacuations." The pattern is architectural leverage — a latency optimization relaxed density and cross-AZ evacuation as side effects.
  7. DHCP is still open — "a multi-phase effort the team is currently working through." Full transparency that not all 300 ms has been eliminated yet; Geneve is the part that closed. Honest ongoing-work disclosure.
  8. Lambda SnapStart (2022) forced a second topology rebuild. SnapStart clones a pre-initialized execution environment to serve many concurrent invocations; each clone needs its own pre-created isolated network namespace with tap, bridge, veth, and tunnel devices. The original design created these on-demand; SnapStart needed them pre-provisioned. Both topologies initially co-existed on the same workers with the 2,500-slot capacity split 200 snapshot + 2,300 on-demand. The 200 cap was a "deliberate trade-off" — snapshot networks required twice as many Linux network devices per VM, and per-device creation/destruction cost scaled with density.
  9. Scaling the SnapStart topology from 200 → 2,500+ and then to 4,000 slots per host meant pulling four distinct bottlenecks out of the critical path:
    • Linux device-creation cost grows with N: Linux traverses existing device lists for each tap/veth/namespace creation, so creating the N+1-th network costs more than the N-th. At 4,000 networks with constant VM turnover, the overhead "never stopped accumulating." Fix: stop creating networks on demand altogether — pre-create all 4,000 at worker initialization (~3 minutes of boot cost). This is named explicitly as Colm MacCárthaigh's constant work principle — the worker pays the same boot cost whether it serves 3 requests or 3 million. New concept: concepts/constant-work-pattern. New pattern: patterns/pre-create-all-network-slots-at-boot (generalizes patterns/warm-pool-zero-create-path to the network-namespace case).
    • Double NAT on the kernel conntrack table: the original stack performed stateful NAT twice per packet — once in the VM's network namespace, once on the eth0 interface. With thousands of VMs running simultaneously, conntrack contention added significant latency. Fix: stateless packet mangling using eBPF, rewriting headers from predetermined mappings instead of tracking connection state. NAT setup latency dropped 100×. New concepts: concepts/double-nat, concepts/stateless-nat-via-ebpf.
    • 125,000 iptables rules in the root network namespace were evaluated sequentially per packet. This wasn't accumulated cruft — it was a density artifact: ~30 rules × 4,000 slots + global fixed rules. A packet for slot 0 processed quickly, slot 4,000 walked through thousands of rules (up to ~1 ms of connection setup latency from rule traversal alone). Fix: move the 30 slot-specific rules into each slot's own network namespace, leaving 144 static, slot-agnostic rules in the root namespace. Per-slot performance skew disappeared. New pattern: patterns/per-slot-iptables-in-namespace.
    • Routing Netlink (RTNL) lock — Linux's single-writer lock for network-config modifications — serialized the parallel creation of 4,000 network devices + namespaces during boot. What should have taken seconds "stretched to minutes." Fix: reorder operations (pool namespaces first, create veth pairs inside the namespace before moving them to root, batch eBPF program attachments for all veth devices in a single operation). Queuing disappeared. New concept: concepts/rtnl-lock-contention.
  10. End state: one unified topology for both traditional and SnapStart workloads, 4,000 snapshot networks per worker (20× the 200 launch cap), all 4,000 created in 3 minutes at worker init, no background CPU drain during invokes, every packet traverses the same 144 iptables rules regardless of slot — and the combined optimizations lowered CPU usage by 1% ("at Lambda's scale, each percent translates to significant infrastructure savings"). The 4,000-slot density figure and the 20× scaling factor are the load-bearing quantified outcomes.
  11. The networking stack was productized for Aurora DSQL. When DSQL needed scalable Firecracker-based networking "with the right security and performance characteristics," the Lambda team encapsulated the full networking stack into a service that DSQL could install and run on their own workers — device management, firewall rules, NAT translation, security hygiene for network-slot reuse. DSQL requests a network when it needs one for a VM and releases it when done; Lambda owns the service and vends new versions, and every optimization Lambda makes flows to DSQL automatically. Disclosed to have "saved the DSQL team months of engineering effort and gave them Lambda-grade networking density from day one." New pattern: patterns/encapsulate-optimization-as-internal-service.
  12. Thesis (framed around Marc Olson's "propeller-to-jet in flight" metaphor and Aristotle's arete): the work doesn't produce press releases. "Success is silent. The reward is knowing what you've worked on is better today than it was a week ago, and that the next team won't run into the same constraints you just removed." Continuity of Marc Olson's EBS retrospective voice (2024) on companies/allthingsdistributed; complements the quantified side of Lambda's story in the Lambda PR/FAQ.

Operational numbers disclosed

Metric Before After
Cold-start overhead (Firecracker migration) >10 s <1 s
Geneve tunnel + DHCP VPC cold-start overhead (pre-eBPF) 300 ms (DHCP still ~, Geneve reduced below)
Geneve tunnel creation latency 150 ms 200 μs (~750×)
NAT setup latency (iptables conntrack → eBPF stateless) 100× lower
iptables rules in root network namespace >125,000 144
Per-packet iptables traversal cost (worst slot) up to ~1 ms constant (144 rules)
Snapshot networks per worker 200 4,000 (20×)
Worker network pre-creation time (on-demand during invoke) ~3 minutes at boot
Platform CPU usage baseline −1% fleet-wide

All numbers attributable to this post (2026-04-22) unless otherwise marked.

Caveats

  • Voice: Werner-narrated retrospective with first-person engineer quotes (Ravi Nagayach, Prashant Singh, Kshitij Gupta). Not a post-mortem — no specific incident is dissected; the framing is continuous optimization, not recovery from failure.
  • Mechanism detail is ~70% present. Named: eBPF, Geneve, VNI, DPDK, iptables, conntrack, RTNL lock, tap/veth/bridge/namespace/tunnel primitives. Not disclosed: the exact eBPF program type (TC / XDP / cgroup), the eBPF map topology used for VNI mapping, the batch-attach API used, the namespace-scoping rule set for iptables, or the service API contract the DSQL team consumes.
  • DHCP is open work. The 300 ms VPC cold-start overhead the Geneve fix targeted was a composite — Geneve is compressed, DHCP is "a multi-phase effort the team is currently working through." Don't read "VPC cold start solved" — read "Geneve portion of VPC cold start solved."
  • Time-to-ship not disclosed. "Better part of a decade" is the only stated duration for the whole arc (2019 Firecracker → 2022 SnapStart → unified topology).
  • 1% CPU saving is a fleet-level claim; per-function impact isn't quantified. "At Lambda's scale, each percent translates to significant infrastructure savings" is the stated justification; specific dollar or MW figures aren't disclosed.
  • DSQL-as-internal-customer is a single production instance of the encapsulation pattern; the post doesn't enumerate other internal AWS consumers. Generalizability is asserted via the "every optimization … flows to DSQL automatically" line, not via scale-out disclosure.
  • SnapStart topology split of 2,500 → 200 snapshot + 2,300 on-demand was intentional and temporary; the post frames the 200 cap as the launch-readiness knob, not a limit discovery.

Relationship to existing wiki content

  • Canonical primary source for eBPF-in-Lambda-networking. Prior wiki framing of eBPF (Datadog / GitHub / Cilium / Cloudflare DDoS) is observability / security / DDoS-side. This is the first data-plane packet-rewrite instance from AWS that the wiki has.
  • Extends sources/2024-11-15-allthingsdistributed-aws-lambda-prfaq-after-10-years: the PR/FAQ disclosed the single-tenant-EC2 → Firecracker density arc; this post discloses the density-unlock-requires-network-topology-rebuild follow-on arc that made 4,000-slot per-worker density actually operationally viable.
  • Extends sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey: previous wiki framing treated DSQL's Firecracker dependency as "same family, not specifically invoked." This post provides the missing connection — DSQL consumes Lambda's networking substrate as a managed internal service, not as a copy of the stack.
  • Confirms Colm MacCárthaigh's constant-work principle as a named AWS design rule, with Lambda's boot-time network pre-creation as a canonical instance. Prior wiki references were analogical only.
  • Complements patterns/warm-pool-zero-create-path: Fly.io's warm-Machine pool (2026-01-14 Sprites post) and Lambda's 4,000-network pre-create are two instances of the same pattern applied to different substrate layers (VM creation vs. network-slot creation).
  • Does not contradict any existing claim. Strictly additive.

Source

Last updated · 319 distilled / 1,201 read