SYSTEM Cited by 1 source
Amazon ECS agent (ecs-agent)¶
The Amazon ECS agent (amazon/amazon-ecs-agent:latest) is the
per-host daemon that joins an EC2 instance to an Amazon
ECS cluster and runs containerised tasks on behalf of ECS. It is
shipped as the Docker image amazon/amazon-ecs-agent and typically
installed as a systemd unit on ECS-optimised base AMIs.
Why it shows up in non-ECS stacks¶
The AWS Deep Learning AMI
(Ubuntu 20.04 at the time of the Pinterest incident) installs
ecs-agent as an active systemd unit even though the DLAMI is
used by many customers on non-ECS orchestrators (Kubernetes, Ray,
raw EC2, SageMaker). If the host is not actually an ECS cluster
member, the agent has no credentials to join a cluster, so it
fails to start the agent container and exits. systemd's restart
policy brings it back. Crash-loop.
The zombie-memcg coupling¶
Each ecs-agent restart starts the agent container (docker run
under the hood), which creates a fresh memory cgroup for the
container. The container exits in seconds. The kernel does not
immediately reclaim the memcg if the page cache still holds
references — these stuck memcgs accumulate as "zombies"
tracked in /proc/cgroups but absent from /sys/fs/cgroup/memory/
(see concepts/zombie-memory-cgroup).
At Pinterest's GPU-fleet scale: ~68,680 tracked memory cgroups vs 240 actually in use after several days of uptime, entirely attributable to the ECS-agent restart loop.
The shape is diagnostic: docker ps -a consistently shows a
single ecs-agent container that was created seconds ago —
$ docker ps -a
CONTAINER ID IMAGE ... CREATED STATUS
c6fdfc760921 amazon/amazon-ecs-agent:latest ... 11 seconds ago Up 10 seconds
No container orchestrator (Kubernetes, Ray, etc.) is creating this — it's the unconfigured systemd unit.
Fix (for non-ECS hosts)¶
Disable the systemd unit in the base image build and reboot to purge any accumulated zombie memcgs:
systemctl disable --now ecs
# rebuild base image or bake change into bootstrap
reboot # required to clear /proc/cgroups zombies
See patterns/disable-default-systemd-units-in-base-image for the general pattern.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical wiki disclosure of ecs-agent-as-zombie-memcg-source
on Pinterest's Kubernetes GPU fleet. The agent was an unintended
inheritance from the Deep Learning AMI's default systemd
configuration; the Kubernetes team didn't know it was running
until 3 months of debugging pointed at it. One AZ was
accidentally spared because an unrelated Kubernetes-bootstrap
bug gated the
ecs-agentsystemd unit on bootstrap-script success.