Skip to content

CONCEPT Cited by 1 source

Zombie memory cgroup

A zombie memory cgroup ("zombie memcg") is a Linux memory cgroup that userspace has destroyed — the container or process that owned it is gone, and the directory under /sys/fs/cgroup/memory/ has been removed — but the kernel still tracks the cgroup structure in /proc/cgroups because it still holds references, typically from page-cache pages charged to that cgroup that the kernel has not yet reclaimed.

Detection signature

Two stock Linux commands produce divergent counts when zombies are present:

# Kernel-tracked memcgs (including zombies)
$ cat /proc/cgroups | grep memory | awk '{print $3}'
68680

# Memcgs actually in use (visible in cgroupfs)
$ find /sys/fs/cgroup/memory/ -type d | wc -l
240

An orders-of-magnitude divergence (Pinterest saw 68,680 vs 240 — ~286× ratio) means something is creating + destroying memcgs faster than the kernel reclaims their state. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

Why it hurts: CPU cost scales with zombie count

Several kernel syscalls — notably mem_cgroup_nr_lru_pages used by kubelet's memory accounting — iterate over all tracked memcgs. At zombie-memcg counts in the tens of thousands, a single invocation can pin one CPU core at 100% %sys for multiple seconds. On Pinterest's 96-vCPU GPU hosts this was invisible to whole-machine aggregate CPU metrics but caught by mpstat -P ALL 1 showing core 39 saturated.

If an ENA network driver thread happens to be scheduled onto the saturated core during this window, it fails to run for >5 s and the driver self-heals with a device resetCPU-starvation-induced network reset, classic class of zombie-memcg blast radius.

Common causes

  • Crash-looping short-lived containers. Each container creation allocates a fresh memcg; short lifetimes + rapid turnover leaks memcgs faster than the kernel's asynchronous reclamation keeps up. The canonical Pinterest example was ecs-agent on Deep Learning AMI hosts, crash- looping because the Kubernetes-hosted instance had no ECS-cluster credentials.
  • Base-image default systemd units doing unintended work. See concepts/base-image-unused-systemd-unit-risk.
  • Workloads with high churn of short-lived cgroup-owning processes can hit the same pathology on any orchestrator.

Partial mitigations

  • Reboot. Clears /proc/cgroups unconditionally. This is the canonical diagnostic tell: "rebooting fixes it for about a week" fingerprints an accumulated-kernel-state pathology (see Pinterest's ~1-week post-reboot re-onset clock).
  • Cause-side fix. Stop the crash-loop. Disable the offending systemd unit in the base image. Reboot once to purge the backlog. Pinterest's actual production fix.
  • Effect-side fix (upstream). Kernel work tracked in blogs.oracle.com/linux/zombie-memcg-issues
  • LKML threads aims to make reclamation more aggressive; useful in aggregate but not a substitute for fixing the producer.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production incident. Three-month joint PinCompute + ML Platform investigation traced Ray training-job crashes on GPU hosts back to ENA driver resets, then to single-core %sys saturation, then (via temporal profiling with Netflix's Flamescope) to kubelet burning 6.5% of total CPU on mem_cgroup_nr_lru_pages — caused by ~70,000 zombie memcgs leaked from a crash-looping ECS agent. Fix: disable the ECS agent systemd unit in the base image + reboot.
Last updated · 550 distilled / 1,221 read