Skip to content

CONCEPT Cited by 1 source

Base-image unused-systemd-unit risk

Cloud base images (AMIs, Docker base images, distro golden images) typically ship with pre-configured systemd units active by default because the image vendor has to serve the most common use case. A unit that is a perfect default for one deployment shape can crash silently in a loop on another deployment shape — and each crash can leak kernel or userspace state that accumulates into a production-impacting pathology over days or weeks.

The anti-pattern

A vendor ships an AMI with systemd unit X.service enabled and started on boot. The unit is correct for the vendor's reference workload (ECS host, DataDog agent target, specific container runtime). The customer boots that AMI in a different orchestrator (Kubernetes, Nomad, raw EC2) where X.service has no valid environment to operate in:

  • No credentials / endpoints to register against.
  • No control-plane it is intended to report to.
  • No upstream dependency that would normally provide its input.

systemd starts it, it fails, systemd restarts it, it fails again. Each restart may allocate short-lived kernel / container / cgroup / namespace resources. A well-behaved crash-loop with a few seconds between restarts can accumulate resource leakage across days of uptime. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

The Pinterest instance — ecs-agent on Deep Learning AMI

The canonical wiki case:

  • Base image: AWS Deep Learning AMI (Ubuntu 20.04).
  • Default systemd unit: ecs.service starting amazon/amazon-ecs-agent via Docker.
  • Deployment context: Pinterest's Kubernetes cluster (PinCompute) on GPU EC2 hosts — not ECS cluster members.
  • Failure: ecs-agent has no ECS credentials, fails to join a cluster, exits in seconds. systemd restarts it. Repeat for days.
  • Leaked state: each ecs-agent container spawn allocated a new memory cgroup; deferred kernel reclamation accumulated zombie memcgs. 68,680 tracked vs 240 in use after several days of uptime.
  • Blast radius: kubelet's mem_cgroup_nr_lru_pages iterated all 68,680 → one CPU core pinned at 100% %sys for seconds → AWS ENA driver Tx thread starved beyond 5 s → device reset → packet drops → Ray training job crashes.
  • Fix: disable ecs.service in the base-image bootstrap and reboot to purge the accumulated zombies.

The AZ-disparity confession

Pinterest's investigation was complicated by one AZ not showing the bug. They eventually discovered an unrelated Kubernetes-binary delivery bug in that AZ caused the node bootstrap script to fail, which gated the ecs-agent systemd unit from starting — two bugs accidentally cancelling. As soon as the Kubernetes team fixed the binary-delivery bug, the ENA-reset problem would have spread to the previously-healthy AZ. The meta-lesson: when two environments that "look the same" behave differently, the investigation is the configuration-drift hunt, not the hardware-lottery hunt.

Mitigation discipline

See patterns/disable-default-systemd-units-in-base-image for the operational playbook. The short version:

  • Audit default systemd units on every base image before promoting it to production.
  • Disable units you know are wrong for your deployment shape in the bootstrap / image-bake step, not at runtime (runtime disables lose to systemctl enable during package upgrades).
  • Monitor systemctl --failed + journalctl -u <unit> | grep restart as a fleet metric to catch silently crashing units.
  • Treat docker ps -a outputs showing containers that are seconds-old as a signal of an invisible crash-loop producer.

Seen in

Last updated · 550 distilled / 1,221 read