CONCEPT Cited by 1 source
Zombie memory cgroup¶
A zombie memory cgroup ("zombie memcg") is a Linux memory cgroup
that userspace has destroyed — the container or process that owned
it is gone, and the directory under /sys/fs/cgroup/memory/ has been
removed — but the kernel still tracks the cgroup structure in
/proc/cgroups because it still holds references, typically from
page-cache pages charged to that cgroup that the kernel has not
yet reclaimed.
Detection signature¶
Two stock Linux commands produce divergent counts when zombies are present:
# Kernel-tracked memcgs (including zombies)
$ cat /proc/cgroups | grep memory | awk '{print $3}'
68680
# Memcgs actually in use (visible in cgroupfs)
$ find /sys/fs/cgroup/memory/ -type d | wc -l
240
An orders-of-magnitude divergence (Pinterest saw 68,680 vs 240 — ~286× ratio) means something is creating + destroying memcgs faster than the kernel reclaims their state. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
Why it hurts: CPU cost scales with zombie count¶
Several kernel syscalls — notably mem_cgroup_nr_lru_pages used by
kubelet's memory accounting — iterate over all tracked memcgs. At
zombie-memcg counts in the tens of thousands, a single invocation can
pin one CPU core at 100% %sys for multiple seconds. On Pinterest's
96-vCPU GPU hosts this was invisible to whole-machine aggregate CPU
metrics but caught by mpstat -P ALL 1 showing core 39 saturated.
If an ENA network driver thread happens to be scheduled onto the saturated core during this window, it fails to run for >5 s and the driver self-heals with a device reset — CPU-starvation-induced network reset, classic class of zombie-memcg blast radius.
Common causes¶
- Crash-looping short-lived containers. Each container creation allocates a fresh memcg; short lifetimes + rapid turnover leaks memcgs faster than the kernel's asynchronous reclamation keeps up. The canonical Pinterest example was ecs-agent on Deep Learning AMI hosts, crash- looping because the Kubernetes-hosted instance had no ECS-cluster credentials.
- Base-image default systemd units doing unintended work. See concepts/base-image-unused-systemd-unit-risk.
- Workloads with high churn of short-lived cgroup-owning processes can hit the same pathology on any orchestrator.
Partial mitigations¶
- Reboot. Clears
/proc/cgroupsunconditionally. This is the canonical diagnostic tell: "rebooting fixes it for about a week" fingerprints an accumulated-kernel-state pathology (see Pinterest's ~1-week post-reboot re-onset clock). - Cause-side fix. Stop the crash-loop. Disable the offending systemd unit in the base image. Reboot once to purge the backlog. Pinterest's actual production fix.
- Effect-side fix (upstream). Kernel work tracked in blogs.oracle.com/linux/zombie-memcg-issues
- LKML threads aims to make reclamation more aggressive; useful in aggregate but not a substitute for fixing the producer.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical production incident. Three-month joint PinCompute + ML
Platform investigation traced Ray training-job crashes on GPU hosts
back to ENA driver resets, then to single-core
%syssaturation, then (via temporal profiling with Netflix's Flamescope) to kubelet burning 6.5% of total CPU onmem_cgroup_nr_lru_pages— caused by ~70,000 zombie memcgs leaked from a crash-looping ECS agent. Fix: disable the ECS agent systemd unit in the base image + reboot.
Related¶
- concepts/linux-cgroup — parent mechanism
- concepts/cpu-starvation-network-driver — dominant blast-radius pathway
- concepts/base-image-unused-systemd-unit-risk — common upstream cause
- systems/aws-ecs-agent — the Pinterest-incident producer
- systems/kubernetes — the kubelet-as-consumer side