Skip to content

SYSTEM Cited by 1 source

AWS Deep Learning AMI (DLAMI)

The AWS Deep Learning AMI is an AWS-maintained Amazon Machine Image preconfigured with GPU-oriented ML software (CUDA, cuDNN, NCCL, PyTorch / TensorFlow builds) for EC2 instances. Available in multiple flavours (Ubuntu-based, Amazon-Linux-based) and updated on a rolling basis.

Default systemd configuration

The DLAMI (Ubuntu 20.04 variant at the time of the 2025 Pinterest incident) sets up ecs-agent as a default systemd unit — a reasonable default for customers using the AMI on ECS, but a latent source of crash-loop-driven resource leaks for customers using it under other orchestrators (Kubernetes, Ray direct-on-VM, SageMaker's non-ECS paths, etc.).

See concepts/base-image-unused-systemd-unit-risk for the general pattern.

Lesson for DLAMI consumers on Kubernetes

If you run the DLAMI as a Kubernetes worker image:

  1. Audit default systemd units (systemctl list-unit-files --state=enabled) and disable the ones your orchestrator doesn't need (notably ecs).
  2. Rebake the base image or at minimum run the disable step in the bootstrap script, gated to run after any DLAMI image update.
  3. Monitor /proc/cgroups vs /sys/fs/cgroup/memory/ -type d | wc -l ratio as a health indicator — divergence > ~10× is an early sign of zombie-memcg accumulation from any crash-looping containerised workload (not just ECS agent).

See patterns/disable-default-systemd-units-in-base-image.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — the DLAMI's ecs-agent systemd default on Pinterest's Kubernetes GPU fleet created a 3-month-long ENA-reset incident surfacing as intermittent Ray training-job crashes. One AZ was accidentally spared by an unrelated Kubernetes-binary-delivery bootstrap bug that gated the ecs-agent unit from starting — a reminder that base-image defaults are load-bearing at fleet scale.
Last updated · 319 distilled / 1,201 read