Skip to content

PATTERN Cited by 1 source

Disable default systemd units in base image

When you inherit a cloud base image or distro golden image, audit its default-enabled systemd units and disable anything that doesn't match your deployment shape — in the image-bake step, not at runtime. Unused-but-enabled units can crash-loop silently and leak kernel or userspace state on a cadence that builds up to production-impacting pathologies over days.

Pattern

  1. Enumerate. On a freshly booted base image instance:
# All enabled units + their state
systemctl list-unit-files --state=enabled

# Actively running units
systemctl list-units --type=service --state=running

# Recently-restarted units (crash-loop signal)
journalctl -u '*.service' --since '1 hour ago' \
  | grep -E 'Started|Stopped|Failed' | head -50
  1. Classify. For each enabled unit, answer:
  2. Do we need it?
  3. If we do, is it configured correctly for our deployment shape?
  4. If we don't, is it silently failing in a way that leaks resources?

  5. Disable offenders permanently in the image-bake step. The place to do this is the Packer / image-builder / Dockerfile script that produces your AMI — not a post-boot runtime disable. Runtime disables lose to systemctl enable during package upgrades and to manual re-enables by operators who don't know the history.

# In your base-image bake script:
systemctl disable --now ecs.service
# Verify it's masked against re-enable if appropriate:
systemctl mask ecs.service
  1. Purge accumulated state. For leak-accumulators (zombie memcgs are the Pinterest example), reboot the fleet after rolling out the image change — runtime disable doesn't undo the leaked state, only a reboot does.

The Pinterest-incident application

(Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

  • Culprit unit: ecs.service on the AWS Deep Learning AMI, launching amazon/amazon-ecs-agent on boot.
  • Why it shipped enabled: the DLAMI is also used by ECS customers, so the AMI maintainers set the default reasonably for that constituency.
  • Why it was wrong for Pinterest: Pinterest runs Kubernetes on the DLAMI, not ECS. No ECS credentials → agent crash-loops.
  • Fix applied: disabled the ecs systemd unit in the base-image build + rebooted all hosts to purge accumulated zombie memcgs. ENA resets stopped; Ray training job success rates returned to baseline.

Detection signals worth monitoring fleet-wide

  • systemctl --failed count per host. Should be 0; sustained non-zero means an unattended crash.
  • Unit restart rate. systemctl show -p NRestarts <unit> over time. A unit with hundreds of restarts per day is silently failing.
  • docker ps -a showing containers that are seconds-old. Invisible crash-loop producer — the container you see is never the same one you saw last time. Pinterest's diagnostic tell:
$ docker ps -a
CONTAINER ID   IMAGE                                ... CREATED         STATUS
c6fdfc760921   amazon/amazon-ecs-agent:latest      ... 11 seconds ago  Up 10 seconds

Trade-offs

  • Image-specific hardening adds maintenance burden — you're diverging from the vendor default. But the alternative is running vendor-default units that are wrong for your orchestrator, which is what creates latent bugs like Pinterest's.
  • Aggressive disabling risks removing units you actually do need. Lean on systemctl mask rather than manual removal so you can undo quickly.
  • The fix requires a fleet reboot to purge the accumulated leak; cadence matters. Rolling restarts via Kubernetes node drain + reboot are the default mechanism.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production application. Disabling the ecs-agent systemd unit in Pinterest's GPU base image + reboot fixed 3 months of intermittent Ray training-job crashes caused by ENA network driver resets induced by kubelet CPU-starvation from iterating ~70,000 zombie memory cgroups created by the ECS agent's crash-loop.
Last updated · 550 distilled / 1,221 read