PATTERN Cited by 1 source
Disable default systemd units in base image¶
When you inherit a cloud base image or distro golden image, audit its default-enabled systemd units and disable anything that doesn't match your deployment shape — in the image-bake step, not at runtime. Unused-but-enabled units can crash-loop silently and leak kernel or userspace state on a cadence that builds up to production-impacting pathologies over days.
Pattern¶
- Enumerate. On a freshly booted base image instance:
# All enabled units + their state
systemctl list-unit-files --state=enabled
# Actively running units
systemctl list-units --type=service --state=running
# Recently-restarted units (crash-loop signal)
journalctl -u '*.service' --since '1 hour ago' \
| grep -E 'Started|Stopped|Failed' | head -50
- Classify. For each enabled unit, answer:
- Do we need it?
- If we do, is it configured correctly for our deployment shape?
-
If we don't, is it silently failing in a way that leaks resources?
-
Disable offenders permanently in the image-bake step. The place to do this is the Packer / image-builder / Dockerfile script that produces your AMI — not a post-boot runtime disable. Runtime disables lose to
systemctl enableduring package upgrades and to manual re-enables by operators who don't know the history.
# In your base-image bake script:
systemctl disable --now ecs.service
# Verify it's masked against re-enable if appropriate:
systemctl mask ecs.service
- Purge accumulated state. For leak-accumulators (zombie memcgs are the Pinterest example), reboot the fleet after rolling out the image change — runtime disable doesn't undo the leaked state, only a reboot does.
The Pinterest-incident application¶
(Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
- Culprit unit:
ecs.serviceon the AWS Deep Learning AMI, launching amazon/amazon-ecs-agent on boot. - Why it shipped enabled: the DLAMI is also used by ECS customers, so the AMI maintainers set the default reasonably for that constituency.
- Why it was wrong for Pinterest: Pinterest runs Kubernetes on the DLAMI, not ECS. No ECS credentials → agent crash-loops.
- Fix applied: disabled the
ecssystemd unit in the base-image build + rebooted all hosts to purge accumulated zombie memcgs. ENA resets stopped; Ray training job success rates returned to baseline.
Detection signals worth monitoring fleet-wide¶
systemctl --failedcount per host. Should be 0; sustained non-zero means an unattended crash.- Unit restart rate.
systemctl show -p NRestarts <unit>over time. A unit with hundreds of restarts per day is silently failing. docker ps -ashowing containers that are seconds-old. Invisible crash-loop producer — the container you see is never the same one you saw last time. Pinterest's diagnostic tell:
$ docker ps -a
CONTAINER ID IMAGE ... CREATED STATUS
c6fdfc760921 amazon/amazon-ecs-agent:latest ... 11 seconds ago Up 10 seconds
Trade-offs¶
- Image-specific hardening adds maintenance burden — you're diverging from the vendor default. But the alternative is running vendor-default units that are wrong for your orchestrator, which is what creates latent bugs like Pinterest's.
- Aggressive disabling risks removing units you actually do
need. Lean on
systemctl maskrather than manual removal so you can undo quickly. - The fix requires a fleet reboot to purge the accumulated leak; cadence matters. Rolling restarts via Kubernetes node drain + reboot are the default mechanism.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production application. Disabling the ecs-agent systemd unit in Pinterest's GPU base image + reboot fixed 3 months of intermittent Ray training-job crashes caused by ENA network driver resets induced by kubelet CPU-starvation from iterating ~70,000 zombie memory cgroups created by the ECS agent's crash-loop.
Related¶
- concepts/base-image-unused-systemd-unit-risk — the underlying risk framing
- concepts/zombie-memory-cgroup — the Pinterest-incident leak type the fix cleared
- systems/aws-ecs-agent — the offending default unit
- systems/aws-deep-learning-ami — the base image that shipped it