Skip to content

SYSTEM Cited by 1 source

Amazon ECS agent (ecs-agent)

The Amazon ECS agent (amazon/amazon-ecs-agent:latest) is the per-host daemon that joins an EC2 instance to an Amazon ECS cluster and runs containerised tasks on behalf of ECS. It is shipped as the Docker image amazon/amazon-ecs-agent and typically installed as a systemd unit on ECS-optimised base AMIs.

Why it shows up in non-ECS stacks

The AWS Deep Learning AMI (Ubuntu 20.04 at the time of the Pinterest incident) installs ecs-agent as an active systemd unit even though the DLAMI is used by many customers on non-ECS orchestrators (Kubernetes, Ray, raw EC2, SageMaker). If the host is not actually an ECS cluster member, the agent has no credentials to join a cluster, so it fails to start the agent container and exits. systemd's restart policy brings it back. Crash-loop.

The zombie-memcg coupling

Each ecs-agent restart starts the agent container (docker run under the hood), which creates a fresh memory cgroup for the container. The container exits in seconds. The kernel does not immediately reclaim the memcg if the page cache still holds references — these stuck memcgs accumulate as "zombies" tracked in /proc/cgroups but absent from /sys/fs/cgroup/memory/ (see concepts/zombie-memory-cgroup).

At Pinterest's GPU-fleet scale: ~68,680 tracked memory cgroups vs 240 actually in use after several days of uptime, entirely attributable to the ECS-agent restart loop.

The shape is diagnostic: docker ps -a consistently shows a single ecs-agent container that was created seconds ago —

$ docker ps -a
CONTAINER ID   IMAGE                                ...  CREATED          STATUS
c6fdfc760921   amazon/amazon-ecs-agent:latest       ...  11 seconds ago   Up 10 seconds

No container orchestrator (Kubernetes, Ray, etc.) is creating this — it's the unconfigured systemd unit.

Fix (for non-ECS hosts)

Disable the systemd unit in the base image build and reboot to purge any accumulated zombie memcgs:

systemctl disable --now ecs
# rebuild base image or bake change into bootstrap
reboot  # required to clear /proc/cgroups zombies

See patterns/disable-default-systemd-units-in-base-image for the general pattern.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical wiki disclosure of ecs-agent-as-zombie-memcg-source on Pinterest's Kubernetes GPU fleet. The agent was an unintended inheritance from the Deep Learning AMI's default systemd configuration; the Kubernetes team didn't know it was running until 3 months of debugging pointed at it. One AZ was accidentally spared because an unrelated Kubernetes-bootstrap bug gated the ecs-agent systemd unit on bootstrap-script success.
Last updated · 319 distilled / 1,201 read