Skip to content

NETFLIX 2026-02-28

Read original ↗

Netflix — Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Summary

Netflix's migration from its legacy virtual-kubelet + docker container runtime to a modern kubelet + containerd runtime on Titus surfaced a startup-path hang: on r5.metal instances starting pods whose images had 50+ image layers, the node's mount table would grow so large that reading it took 30+ seconds, systemd and containerd would both stall trying to process mount events, and kubelet health checks timed out. Root cause is the interaction between three things: (1) the new runtime gives each container its own host user range and uses the kernel's idmap mount feature instead of untar-time UID shifting, so containerd issues open_tree() + mount_setattr() + move_mount() per image layer, twice per container start; (2) those mount syscalls all take a global VFS mount lock in the Linux kernel; (3) on some modern server CPU architectures (Intel mesh-style with a central Table-of-Requests, 2-socket NUMA, hyperthreading enabled) contested atomics on that lock pile up behind a central queueing structure and cause 95.5% of pipeline slots to stall on contested accesses, 57% on false sharing. Netflix ships both fixes the ecosystem needed: a software fix (containerd PR #12092) that bind-mounts the common parent directory of all layers instead of each layer individually, collapsing the per-container mount count from O(n) to O(1) in the number of layers; and an operational mitigation routing affected workloads to CPU architectures that scale better under this access pattern (AMD chiplet-based 7th-gen, or Intel with hyperthreading disabled). Microbenchmark released at github.com/Netflix/global-lock-bench.

Key takeaways

  1. Per-container mount cost is O(layer-count × 2) in the new runtime, times all concurrent container starts. "For example, assume a node is starting 100 containers, each with 50 layers in its image. Each container will need 50 bind mounts to do the idmap for each layer. The container's overlayfs mount will be created using those bind mounts as the lower directories, and then all 50 bind mounts can be cleaned up via umount. Containerd actually goes through this process twice, once to determine some user information in the image and once to create the actual rootfs. This means the total number of mount operations on the start up path for our 100 containers is 100 * 2 * (1 + 50 + 50) = 20200 mounts, all of which require grabbing various global mount related locks!" (Source: this post.) Canonical concepts/container-layer-count datum — the global VFS mount lock makes image layer count a first-class performance variable once user-namespace idmap is in play.

  2. The new runtime's security win created the contention. "Old Runtime: All containers shared a single host user range. UIDs in image layers were shifted at untar time, so file permissions matched when containers accessed files. This worked because all containers used the same host user. New Runtime: Each container gets a unique host user range, improving security — if a container escapes, it can only affect its own files. To avoid the costly process of untarring and shifting UIDs for every container, the new runtime uses the kernel's idmap feature." (Source: this post.) The per-container unique host user range is the security improvement; the idmap-mount substitute for untar-time UID-shift is what trades untar cost for mount-count cost. The team did not regress on security — they exposed a kernel/hardware bottleneck that was invisible to the old untar-time flow.

  3. The hot path is a spin loop on mount_lock in path_init(). Netflix's perf record + microbenchmarks showed "the hottest code path was in the Linux kernel's Virtual Filesystem (VFS) path lookup code — specifically, a tight spin loop waiting on a sequence lock in path_init(). The CPU spent most of its time executing the pause instruction, indicating many threads were spinning, waiting for the global lock." (Source: this post.) The x86 pause instruction in the disassembly is the visible signature. Intel TMA breakdown: "95.5% of pipeline slots were stalled on contested accesses (tma_contested_accesses). 57% of slots were due to false sharing (multiple cores accessing the same cache line). Cache line bouncing and lock contention were the primary culprits." (Source: this post.) Canonical wiki datum for concepts/false-sharing + TMA as the diagnostic method.

  4. Hardware architecture is the third variable. Netflix benchmarked three AWS instance families:

Instance Generation / vendor Sockets / NUMA Observation
r5.metal 5th-gen Intel 2-socket, multi-NUMA fails around 100 concurrent containers
m7i.metal-24xl 7th-gen Intel 1-socket, 1 NUMA lower launch latency + higher success, disabling HT gives another 20–30%
m7a.24xlarge 7th-gen AMD 1-socket, 1 NUMA, chiplet most consistent scaling, lowest failure rates

Three orthogonal hardware axes matter: NUMA hop count (cross-socket atomic coherence = latency spike), hyperthreading (two logical threads fighting for shared execution-unit + cache = worse lock-acquire latency), and mesh-interconnect Table-of-Requests vs chiplet-based distributed cache topology. The post documents centralized-cache TOR queueing as the Intel mesh bottleneck and contrasts it with AMD's chiplet CCX+fabric where contention is spread across domains.

  1. A custom microbenchmark isolates the hardware-level signal. Netflix wrote and open-sourced global-lock-bench (systems/netflix-global-lock-bench): spin up N threads contending on a single global lock, sweep thread count, measure latency. This cleanly reproduced the r5.metal cliff + the HT-off win on m7i + the m7a advantage — without needing containerd in the loop. Canonical wiki instance of contended-lock microbenchmark as a hardware-evaluation tool. Method matches the runq-monitor + patterns/dual-metric-disambiguation discipline: isolate the signal from the application.

  2. The software fix is a common-parent bind mount. Two options were on the table working with containerd upstream: "1. Use the newer kernel mount API's fsconfig() lowerdir+ support to supply the idmap'ed lowerdirs as fd's instead of filesystem paths. This avoids the move_mount() syscall mentioned prior which requires global locks to mount each layer to the mount table. 2. Map the common parent directory of all the layers. This makes the number of mount operations go from O(n) to O(1) per container, where n is the number of layers in the image. Since using the newer API requires using a new kernel, we opted to make the latter change to benefit more of the community." (Source: this post.) Shipped in containerd PR #12092. Eleventh canonical instance of patterns/upstream-the-fix on the wiki and a tidy example of the choice criterion: when two upstream fixes both work, choose the one with the widest blast radius for the user population that needs it, even at the cost of being less elegant. The O(n) → O(1) reduction is canonicalised as patterns/common-parent-bind-mount.

  3. Immediate mitigation: route affected workloads to better-scaling hardware. "For an immediate mitigation we chose to route these workloads to CPU architectures that scaled better under these conditions." (Source: this post.) Until the software fix landed and propagated to every node, Netflix used the Titus scheduler to keep heavy-layer images off r5.metal. Canonical wiki instance of patterns/workload-to-architecture-affinity-routing — the operational complement to patterns/upstream-the-fix when you have to run containers today and patches take weeks to ship.

Systems, concepts, and patterns extracted

Systems

  • systems/netflix-titus (extended) — Netflix's container platform; the mount-mayhem story is its runtime-migration retrospective.
  • systems/containerd (extended) — the container runtime whose per-layer idmap-mount pattern surfaced the VFS lock bottleneck; PR #12092 is the upstream fix.
  • systems/linux-overlayfs — the overlayfs rootfs that each container's bind-mounted idmap'd layers feed as lowerdirs.
  • systems/intel-tma — Intel Topdown Microarchitecture Analysis, the CPU-performance methodology Netflix used to attribute 95.5% of pipeline-slot stalls to contested accesses.
  • systems/netflix-global-lock-bench — the open-sourced microbenchmark that reproduces the hardware signal cleanly.

Concepts

Patterns

  • patterns/common-parent-bind-mount — when many sibling paths need identical treatment, bind-mount the parent and pick them up by relative path inside; reduces mount-count from O(n) to O(1).
  • patterns/workload-to-architecture-affinity-routing — operational mitigation by routing workloads to CPU architectures that handle their access pattern better, bridging to the software fix.
  • patterns/contended-lock-microbenchmark — synthetic N-threads-on-one-lock benchmark as a hardware-evaluation tool independent of the real workload; global-lock-bench is the canonical instance.
  • patterns/upstream-the-fix (extended, 11th instance) — Netflix's containerd PR #12092 as the canonical container-runtime variant; joins the language-runtime (Netflix JDK 2024-07-29), compiler (Cloudflare Go arm64), library (Fly.io parking_lot / rustls), framework (Meta FFmpeg / libwebrtc), and allocator (Meta jemalloc stewardship) instances.

Operational numbers

Metric Value Source
Mount table read time (stalled r5.metal) up to 30 s body ("reading it alone could take upwards of 30 seconds")
Layer count threshold for symptom 50+ layers body ("whose container image contained many layers (50+)")
Worked-example mount count 20 200 mounts / 100 containers / 50 layers body (formula 100 × 2 × (1 + 50 + 50))
Concurrency cliff on r5.metal ~100 containers Figure 3 description
TMA pipeline-slot stall fraction 95.5% contested accesses body
Fraction of slots in false sharing 57% body
HT-off improvement on m7i.metal-24xl 20–30% launch-latency reduction Figure 6 description
m7a vs m7i-no-HT consistent gap ~20% Figure 9 description
Affected instance family r5.metal (5th-gen Intel, 2-socket) body

Caveats

  • Scope is restricted to Titus's high-layer-count image shape. Posts with short images (few layers) don't hit the cliff; Netflix names the 50+ layer threshold explicitly. Teams consuming this post should audit their own image-layer distribution before concluding they need the fix.
  • Root cause is a kernel/hardware interaction, not a bug in containerd. containerd's per-layer idmap-mount logic is correct; the per-layer cost only becomes pathological once concurrency × layer-count × hardware-architecture align. The PR is an optimisation, not a fix.
  • The newer fsconfig() API alternative is the cleaner long-term fix. Netflix explicitly flags this — "Since using the newer API requires using a new kernel, we opted to make the latter change to benefit more of the community." Kernels with fsconfig(FSCONFIG_CMD_CREATE_EXCL) + lowerdir+ support can skip move_mount() entirely; the post's fix is the less elegant option chosen for broader kernel compatibility.
  • No quantified post-fix operational numbers disclosed. Flamegraph in Figure 11 shows mount-ops dropped out of the hot path ("we had to highlight them in purple below to see them at all!"), but launch-latency / success-rate deltas after the fix + after architecture-routing mitigation are not published.
  • Specific CPU-vendor TOR + chiplet details are gestured at through generic diagrams. The post cites Intel's DDIO paper for the mesh TOR and ServeTheHome on AMD Genoa for chiplets — concrete Intel / AMD model families are implied (Ice Lake / Sapphire Rapids on mesh; EPYC Genoa for chiplet) but not disclosed as specific part numbers.
  • The new-runtime security improvement is qualitative in the post. "if a container escapes, it can only affect its own files" — true, but no CVE-class escape or quantified exposure reduction is cited to establish the security trade-off magnitude against the startup-latency cost.
  • Order-of-operations reveals containerd's double-traversal. "Containerd actually goes through this process twice, once to determine some user information in the image and once to create the actual rootfs." This doubles the per-container mount count; arguably a second optimisation opportunity the PR doesn't address.
  • Boot timing only — steady-state runtime behaviour is not under this regime. The bottleneck is a startup-path phenomenon; once the overlayfs is constructed and the idmap bind mounts are unmounted, there's no ongoing mount-lock pressure on the container.

Source

Last updated · 319 distilled / 1,201 read