Skip to content

Pinterest — Finding zombies in our systems: A real-world story of CPU bottlenecks

Summary

Pinterest's PinCompute (Kubernetes) + ML Platform teams' three-month joint investigation into a production incident where Ray-based distributed ML training jobs on GPU EC2 instances were crashing intermittently with "loss of network connectivity" errors — surfaced in kernel logs as the AWS ENA (Elastic Network Adapter) Linux driver performing a device reset ("TX q 5 is paused for too long") whenever a Tx thread went un-scheduled for

5 s. The team peeled the onion: per-core mpstat revealed one CPU core pinned at 100% %sys for multiple seconds at a time (hidden by aggregate perf); temporal profiling via continuous 2-minute perf record snapshots correlated with reset timestamps and visualised with Netflix's Flamescope tool showed kubelet burning 6.5% of total CPU on mem_cgroup_nr_lru_pages just before each reset; /proc/cgroups reported ~68,680 tracked memory cgroups while /sys/fs/cgroup/memory/ held only 240 actually in use — ~70,000 zombie memcgs. Root cause: the AWS Deep Learning AMI ships ecs-agent as a default systemd unit; on Pinterest's Kubernetes hosts (which are not ECS cluster members) the agent crash-loops forever, each crash leaking a memory cgroup. One AZ was unaffected because an unrelated Kubernetes-binary-delivery bug caused the node bootstrap script to fail, which gated the ECS agent from starting — accidentally hiding the same latent misconfiguration. Fix: disable the ECS agent systemd unit in the base image and reboot to purge zombie memcgs. Success rate recovered; Ray training jobs stopped crashing.

Key takeaways

  1. ENA Tx-paused-5 s resets are a CPU-starvation symptom, not a network bug. Any time the ENA driver's Tx thread goes unscheduled for 5 s on any one CPU core, the driver self-heals with a <1 ms reset that may drop packets. Root cause lives in whoever starved the scheduler, not in the network stack. See concepts/network-driver-reset + concepts/cpu-starvation-network-driver. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
  2. Aggregate perf hides single-core CPU saturation. Pinterest's 96-vCPU GPU machines can have one core pinned at 100% %sys while the overall CPU sits at 20% — invisible to a whole-machine perf view but caught by per-second mpstat -P ALL. A single saturated core is enough to starve an unlucky network-driver thread scheduled onto it. See concepts/per-core-cpu-visibility.
  3. Temporal profiling is the right instrument for sporadic CPU spikes. A random perf sample has near-zero chance of catching a rare spike; running perf record -F 97 -g -a -o perf-HOST-TIMESTAMP-120s.data -- sleep 120 in a bash loop for 12 hours and then time-travelling to the flamegraph window around the ENA reset event was what finally exposed the kubelet as the culprit. Flamescope (Netflix) is the visualisation tool. See concepts/temporal-profiling + patterns/continuous-perf-record-for-time-travel.
  4. Zombie memory cgroups are a real and detectable pathology. cat /proc/cgroups | grep memory | awk '{print $3}' reports kernel-tracked memcgs including zombies; find /sys/fs/cgroup/memory/ -type d | wc -l reports actually in use. Orders-of-magnitude divergence (Pinterest: 68,680 vs 240, 286× ratio) means something is creating + deleting memcgs faster than the kernel reclaims state. See concepts/zombie-memory-cgroup.
  5. The host-reboot partial mitigation is the diagnostic signal for accumulated-state pathologies. Reboot-fixes-it-for-about-a-week is the fingerprint of leaked kernel state that resets at boot. The Pinterest team clocked the re-onset at ~1 week of uptime. Any "turn it off and on again" mitigation with that cadence points at the same class of problem.
  6. Base AMI default systemd units are a configuration-drift risk. The AWS Deep Learning AMI sets up ecs-agent as an active systemd unit because it's a reasonable default for ECS users. Pinterest runs Kubernetes; the agent fails to join an ECS cluster (no credentials), crashes, is restarted by systemd, crashes again. Each crash creates a short-lived container → new memory cgroup → leaked. See concepts/base-image-unused-systemd-unit-risk + patterns/disable-default-systemd-units-in-base-image.
  7. Differences between "identical" environments are load-bearing. The AZ disparity looked inexplicable because the Kubernetes team thought the clusters were configured identically. They weren't: an unrelated bug in Kubernetes binary delivery caused one AZ's bootstrap script to fail, which gated the ECS agent from starting, which accidentally prevented zombie-memcg accumulation in that AZ. The "healthy" AZ was masking the problem via a second bug. When two environments diverge in behaviour, look harder for configuration drift.
  8. Reproducible closed debugging environments are worth the overhead. Pinterest reserved a small set of K8s-tainted machines, kicked off hyper-parameter-tuning training as synthetic constant-footprint load, and ran the perf collection loop overnight. The closed environment is what made the 3-month mystery tractable in one overnight run once the right instrument was on. See patterns/reserved-host-repro-env.
  9. Invest in fleet-wide transient-metric collection. The 40% ENA-reset concentration in one AZ was only visible because the ML Platform team emitted per-reset metrics fleet-wide. Without that, every incident would have looked like a random anomaly and the AZ correlation would never have surfaced. This is the ops-prereq for the rest of the investigation.

Operational numbers

  • 96 vCPU cores per GPU machine — enough that aggregate perf invisibilises single-core saturation.
  • 5 s hard-coded ENA driver Tx-paused threshold → reset.
  • <1 ms ENA reset duration (self-healing mechanism, expected packet loss during the window).
  • 68,680 memory cgroups tracked by kernel (/proc/cgroups) at incident peak.
  • 240 memory cgroups actually in use (find /sys/fs/cgroup/memory/).
  • 286× ratio — kernel-tracked : in-use.
  • 6.5% of total CPU consumed by kubelet a few seconds before each ENA reset (vs <1% baseline); localised to mem_cgroup_nr_lru_pages kernel syscall.
  • >25% drop in training-job success rate pre-fix for some Ray workloads.
  • ~1 week uptime post-reboot before ENA resets returned on a rebooted machine.
  • ~3 months total investigation time end to end (early 2025 to mid-2025).
  • 12 hours continuous per-host perf profiling window to catch a reset event.
  • 2 minutes per perf record increment (tuned to keep individual perf.data file sizes manageable).
  • ~70 seconds into a given 2-minute window when the representative ENA reset fired in the repro session → zoomed in on a 5-second Flamescope sub-window.
  • 97 Hz perf sampling frequency (-F 97).

Caveats

  • Incident retrospective voice — the post walks the debugging journey chronologically and so mixes hypotheses ("we speculated that our training jobs were leveraging inefficient memory allocators") with confirmed root causes (the ECS agent crash-loop). The takeaway for readers is that ~6 of the attempted mitigations (TransparentHugePages, jemalloc, taskset CPU affinity, interrupt pinning to other cores) did not fix the real problem — they're cataloged here as attempts, not as canonical mitigations for CPU-starvation-induced ENA resets.
  • No per-workload latency breakdowns disclosed. The post reports "success rate drop >25%" for some use cases but does not break down reset rate by training-job shape, GPU family, or topology.
  • No fleet-wide numbers post-fix disclosed — the post asserts "our Ray Training jobs were running with their expected high success rate again" but does not give a pre/post percentage table.
  • gProfiler mentioned but not Pinterest-wide at the time. Pinterest say they are "developing and rolling out gProfiler in close collaboration with Intel"; the 2025 incident relied on ad-hoc bash for loops around perf record instead. The gProfiler rollout is the ops-hardening outcome, not the incident instrument.
  • AZ disparity "we messed up a little." The confession is that Pinterest's PinCompute team had an unrelated Kubernetes-binary delivery bug in one AZ that accidentally masked the incident. The post does not elaborate on that delivery mechanism (which URL served the k8s binary, what "last step" emitted the gating metric, etc.) — the relevant point is the meta-lesson: two environments looked the same but diverged in behaviour, and "look closer" was the discipline that unstuck the investigation.
  • ENA driver internals are linked but not summarised. The CPU Starvation section of AWS's ENA Linux Best Practices is the load-bearing upstream reference for the 5 s threshold + Tx-thread scheduling expectation.
  • Zombie-memcg upstream fix. The linked Oracle blog (blogs.oracle.com/linux/zombie-memcg-issues) and several LKML threads track kernel-level work to reduce zombie-memcg accumulation. Pinterest's fix was at the cause layer (stop creating them) rather than the effect layer (make the kernel clean up faster) — both are valid; only the first was in scope for a Tier-2 user.

Systems

  • systems/pinterest-pincompute — Pinterest's Kubernetes-based general-purpose compute platform; runs on AWS EC2 with per-AZ zonal clusters; hosts Ray clusters for ML workloads.
  • systems/ray — distributed-compute substrate for ML training and inference at Pinterest; Control-Plane + Data-Plane gRPC traffic makes Ray highly latency-sensitive to network instability.
  • systems/kubernetes — EKS-adjacent container orchestrator whose kubelet agent was the CPU starver (iterating zombie memory cgroups in mem_cgroup_nr_lru_pages).
  • systems/aws-ena-driver — AWS Elastic Network Adapter Linux driver; self-resets on Tx thread CPU starvation; <1 ms per reset with expected packet loss.
  • systems/aws-ecs-agent — Amazon ECS container agent shipped as a default systemd unit in the AWS Deep Learning AMI; the crash-loop offender that leaked memcgs.
  • systems/aws-deep-learning-ami — AWS base image for GPU EC2 instances; the default-systemd-unit carrier in the Pinterest incident.
  • systems/aws-ec2 — underlying compute substrate.
  • systems/flamescope — Netflix-authored temporal-flamegraph visualisation tool; the instrument that localised the 5-second CPU spike to kubelet's mem_cgroup_nr_lru_pages.
  • systems/linux-perf — Linux sampling profiler; the raw instrument behind temporal profiling.
  • systems/mpstat — per-CPU utilisation tool from the sysstat package; revealed that one core was at 100% %sys while aggregate was benign.
  • systems/jemalloc — memory allocator Pinterest tried (unsuccessfully) as a page-faulting-reduction mitigation before reaching the real root cause.

Concepts

  • concepts/zombie-memory-cgroup — memory cgroups that have been destroyed by userspace but retained by the kernel (typically due to deferred page-cache reclamation); cause kubelet mem_cgroup_nr_lru_pages CPU spikes proportional to total zombie count.
  • concepts/cpu-starvation-network-driver — class of incident where a network driver's kernel thread doesn't get CPU time for multiple seconds, triggering driver self-heal (reset) with transient packet loss.
  • concepts/temporal-profiling — continuous profiling with wall-clock-timestamped records so rare events can be "time-travelled to" after the fact; contrast with one-shot / on-demand profiling that must coincide with the event.
  • concepts/per-core-cpu-visibility — triage axis often missed by aggregate CPU metrics; on large-vCPU machines a single core can saturate without moving whole-machine utilisation. mpstat -P ALL 1 is the canonical stock-tool answer.
  • concepts/base-image-unused-systemd-unit-risk — default systemd units in a base OS image doing work the host doesn't need; if they fail in a loop they can accumulate kernel / userspace state.
  • concepts/network-driver-reset — ENA driver self-healing mechanism triggered when hardware or kernel threads fall outside expected liveness windows.

Patterns

Source

Last updated · 319 distilled / 1,201 read