PATTERN Cited by 1 source

Disable default systemd units in base image¶

When you inherit a cloud base image or distro golden image, audit its default-enabled systemd units and disable anything that doesn't match your deployment shape — in the image-bake step, not at runtime. Unused-but-enabled units can crash-loop silently and leak kernel or userspace state on a cadence that builds up to production-impacting pathologies over days.

Pattern¶

Enumerate. On a freshly booted base image instance:

# All enabled units + their state
systemctl list-unit-files --state=enabled

# Actively running units
systemctl list-units --type=service --state=running

# Recently-restarted units (crash-loop signal)
journalctl -u '*.service' --since '1 hour ago' \
  | grep -E 'Started|Stopped|Failed' | head -50

Classify. For each enabled unit, answer:
Do we need it?
If we do, is it configured correctly for our deployment shape?
If we don't, is it silently failing in a way that leaks resources?
Disable offenders permanently in the image-bake step. The place to do this is the Packer / image-builder / Dockerfile script that produces your AMI — not a post-boot runtime disable. Runtime disables lose to systemctl enable during package upgrades and to manual re-enables by operators who don't know the history.

# In your base-image bake script:
systemctl disable --now ecs.service
# Verify it's masked against re-enable if appropriate:
systemctl mask ecs.service

Purge accumulated state. For leak-accumulators (zombie memcgs are the Pinterest example), reboot the fleet after rolling out the image change — runtime disable doesn't undo the leaked state, only a reboot does.

The Pinterest-incident application¶

(Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

Culprit unit: ecs.service on the AWS Deep Learning AMI, launching amazon/amazon-ecs-agent on boot.
Why it shipped enabled: the DLAMI is also used by ECS customers, so the AMI maintainers set the default reasonably for that constituency.
Why it was wrong for Pinterest: Pinterest runs Kubernetes on the DLAMI, not ECS. No ECS credentials → agent crash-loops.
Fix applied: disabled the ecs systemd unit in the base-image build + rebooted all hosts to purge accumulated zombie memcgs. ENA resets stopped; Ray training job success rates returned to baseline.

Detection signals worth monitoring fleet-wide¶

systemctl --failed count per host. Should be 0; sustained non-zero means an unattended crash.
Unit restart rate. systemctl show -p NRestarts <unit> over time. A unit with hundreds of restarts per day is silently failing.
docker ps -a showing containers that are seconds-old. Invisible crash-loop producer — the container you see is never the same one you saw last time. Pinterest's diagnostic tell:

$ docker ps -a
CONTAINER ID   IMAGE                                ... CREATED         STATUS
c6fdfc760921   amazon/amazon-ecs-agent:latest      ... 11 seconds ago  Up 10 seconds

Trade-offs¶

Image-specific hardening adds maintenance burden — you're diverging from the vendor default. But the alternative is running vendor-default units that are wrong for your orchestrator, which is what creates latent bugs like Pinterest's.
Aggressive disabling risks removing units you actually do need. Lean on systemctl mask rather than manual removal so you can undo quickly.
The fix requires a fleet reboot to purge the accumulated leak; cadence matters. Rolling restarts via Kubernetes node drain + reboot are the default mechanism.

Seen in¶

sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production application. Disabling the ecs-agent systemd unit in Pinterest's GPU base image + reboot fixed 3 months of intermittent Ray training-job crashes caused by ENA network driver resets induced by kubelet CPU-starvation from iterating ~70,000 zombie memory cgroups created by the ECS agent's crash-loop.

concepts/base-image-unused-systemd-unit-risk — the underlying risk framing
concepts/zombie-memory-cgroup — the Pinterest-incident leak type the fix cleared
systems/aws-ecs-agent — the offending default unit
systems/aws-deep-learning-ami — the base image that shipped it