SYSTEM Cited by 1 source

Pinterest PinCompute¶

PinCompute is Pinterest's Kubernetes-backed general-purpose compute platform on AWS EC2. It hosts a large share of Pinterest's offline + online compute — notably the Ray clusters that run

50% of the company's offline ML workload ("tens of thousands of Ray clusters per month").

Shape (as disclosed)¶

Zonal clusters. One Kubernetes cluster per AWS Availability Zone — "The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone)." Blast-radius containment + latency locality; the downside is that cross-AZ configuration drift has a real operational impact (see the 2025 ENA-reset incident below).
GPU instance family for ML training: 96 vCPU per host.
AWS Deep Learning AMI (Ubuntu 20.04) as the base machine image at the time of the 2025 incident. The DLAMI default-systemd-units are what introduced the latent bug.
Taints + tolerations used as the reservation mechanism for controlled debugging hosts: "Reserved a small number of machines (via Kubernetes taints) for analysis" — see patterns/reserved-host-repro-env.

Seen in¶

sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — PinCompute + ML Platform joint 3-month investigation into intermittent Ray training-job crashes traced to AWS ENA driver resets triggered by kubelet CPU starvation on a single core, in turn caused by zombie memory-cgroup accumulation from a crash-looping ecs-agent systemd unit shipped by the Deep Learning AMI. First wiki-canonicalised disclosure of PinCompute's zonal K8s shape + its GPU-fleet operational discipline.

systems/kubernetes — substrate
systems/ray — top user workload
systems/aws-ec2 — underlying IaaS
systems/aws-deep-learning-ami — base image that shipped the zombie-memcg root cause
companies/pinterest

Pinterest PinCompute¶

Shape (as disclosed)¶

Seen in¶

Related¶