SYSTEM Cited by 1 source
Pinterest PinCompute¶
PinCompute is Pinterest's Kubernetes-backed general-purpose compute platform on AWS EC2. It hosts a large share of Pinterest's offline + online compute — notably the Ray clusters that run
50% of the company's offline ML workload ("tens of thousands of Ray clusters per month").
Shape (as disclosed)¶
- Zonal clusters. One Kubernetes cluster per AWS Availability Zone — "The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone)." Blast-radius containment + latency locality; the downside is that cross-AZ configuration drift has a real operational impact (see the 2025 ENA-reset incident below).
- GPU instance family for ML training: 96 vCPU per host.
- AWS Deep Learning AMI (Ubuntu 20.04) as the base machine image at the time of the 2025 incident. The DLAMI default-systemd-units are what introduced the latent bug.
- Taints + tolerations used as the reservation mechanism for controlled debugging hosts: "Reserved a small number of machines (via Kubernetes taints) for analysis" — see patterns/reserved-host-repro-env.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— PinCompute + ML Platform joint 3-month investigation into
intermittent Ray training-job crashes traced to AWS ENA driver
resets triggered by
kubeletCPU starvation on a single core, in turn caused by zombie memory-cgroup accumulation from a crash-loopingecs-agentsystemd unit shipped by the Deep Learning AMI. First wiki-canonicalised disclosure of PinCompute's zonal K8s shape + its GPU-fleet operational discipline.
Related¶
- systems/kubernetes — substrate
- systems/ray — top user workload
- systems/aws-ec2 — underlying IaaS
- systems/aws-deep-learning-ami — base image that shipped the zombie-memcg root cause
- companies/pinterest