SYSTEM Cited by 1 source
AWS Deep Learning AMI (DLAMI)¶
The AWS Deep Learning AMI is an AWS-maintained Amazon Machine Image preconfigured with GPU-oriented ML software (CUDA, cuDNN, NCCL, PyTorch / TensorFlow builds) for EC2 instances. Available in multiple flavours (Ubuntu-based, Amazon-Linux-based) and updated on a rolling basis.
Default systemd configuration¶
The DLAMI (Ubuntu 20.04 variant at the time of the 2025 Pinterest
incident) sets up ecs-agent as a default systemd unit — a
reasonable default for customers using the AMI on ECS, but a
latent source of crash-loop-driven resource leaks for customers
using it under other orchestrators (Kubernetes, Ray direct-on-VM,
SageMaker's non-ECS paths, etc.).
See concepts/base-image-unused-systemd-unit-risk for the general pattern.
Lesson for DLAMI consumers on Kubernetes¶
If you run the DLAMI as a Kubernetes worker image:
- Audit default systemd units (
systemctl list-unit-files --state=enabled) and disable the ones your orchestrator doesn't need (notablyecs). - Rebake the base image or at minimum run the disable step in the bootstrap script, gated to run after any DLAMI image update.
- Monitor
/proc/cgroupsvs/sys/fs/cgroup/memory/ -type d | wc -lratio as a health indicator — divergence > ~10× is an early sign of zombie-memcg accumulation from any crash-looping containerised workload (not just ECS agent).
See patterns/disable-default-systemd-units-in-base-image.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— the DLAMI's
ecs-agentsystemd default on Pinterest's Kubernetes GPU fleet created a 3-month-long ENA-reset incident surfacing as intermittent Ray training-job crashes. One AZ was accidentally spared by an unrelated Kubernetes-binary-delivery bootstrap bug that gated theecs-agentunit from starting — a reminder that base-image defaults are load-bearing at fleet scale.