CONCEPT Cited by 1 source
Base-image unused-systemd-unit risk¶
Cloud base images (AMIs, Docker base images, distro golden images) typically ship with pre-configured systemd units active by default because the image vendor has to serve the most common use case. A unit that is a perfect default for one deployment shape can crash silently in a loop on another deployment shape — and each crash can leak kernel or userspace state that accumulates into a production-impacting pathology over days or weeks.
The anti-pattern¶
A vendor ships an AMI with systemd unit X.service enabled and
started on boot. The unit is correct for the vendor's reference
workload (ECS host, DataDog agent target, specific container
runtime). The customer boots that AMI in a different orchestrator
(Kubernetes, Nomad, raw EC2) where X.service has no valid
environment to operate in:
- No credentials / endpoints to register against.
- No control-plane it is intended to report to.
- No upstream dependency that would normally provide its input.
systemd starts it, it fails, systemd restarts it, it fails again. Each restart may allocate short-lived kernel / container / cgroup / namespace resources. A well-behaved crash-loop with a few seconds between restarts can accumulate resource leakage across days of uptime. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
The Pinterest instance — ecs-agent on Deep Learning AMI¶
The canonical wiki case:
- Base image: AWS Deep Learning AMI (Ubuntu 20.04).
- Default systemd unit:
ecs.servicestarting amazon/amazon-ecs-agent via Docker. - Deployment context: Pinterest's Kubernetes cluster (PinCompute) on GPU EC2 hosts — not ECS cluster members.
- Failure: ecs-agent has no ECS credentials, fails to join a cluster, exits in seconds. systemd restarts it. Repeat for days.
- Leaked state: each ecs-agent container spawn allocated a new memory cgroup; deferred kernel reclamation accumulated zombie memcgs. 68,680 tracked vs 240 in use after several days of uptime.
- Blast radius: kubelet's
mem_cgroup_nr_lru_pagesiterated all 68,680 → one CPU core pinned at 100%%sysfor seconds → AWS ENA driver Tx thread starved beyond 5 s → device reset → packet drops → Ray training job crashes. - Fix: disable
ecs.servicein the base-image bootstrap and reboot to purge the accumulated zombies.
The AZ-disparity confession¶
Pinterest's investigation was complicated by one AZ not showing the bug. They eventually discovered an unrelated Kubernetes-binary delivery bug in that AZ caused the node bootstrap script to fail, which gated the ecs-agent systemd unit from starting — two bugs accidentally cancelling. As soon as the Kubernetes team fixed the binary-delivery bug, the ENA-reset problem would have spread to the previously-healthy AZ. The meta-lesson: when two environments that "look the same" behave differently, the investigation is the configuration-drift hunt, not the hardware-lottery hunt.
Mitigation discipline¶
See patterns/disable-default-systemd-units-in-base-image for the operational playbook. The short version:
- Audit default systemd units on every base image before promoting it to production.
- Disable units you know are wrong for your deployment shape in
the bootstrap / image-bake step, not at runtime (runtime disables
lose to
systemctl enableduring package upgrades). - Monitor
systemctl --failed+journalctl -u <unit> | grep restartas a fleet metric to catch silently crashing units. - Treat
docker ps -aoutputs showing containers that are seconds-old as a signal of an invisible crash-loop producer.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical case study. The Deep Learning AMI's default ECS agent was the root cause of a 3-month Pinterest production incident; the Kubernetes team didn't know it was running until late in the debugging process because it wasn't orchestrator-managed.
Related¶
- concepts/zombie-memory-cgroup — the Pinterest-instance leak type
- systems/aws-ecs-agent — the offending unit
- systems/aws-deep-learning-ami — the base image
- patterns/disable-default-systemd-units-in-base-image — operational pattern