Skip to content

SYSTEM Cited by 1 source

Pinterest PinCompute

PinCompute is Pinterest's Kubernetes-backed general-purpose compute platform on AWS EC2. It hosts a large share of Pinterest's offline + online compute — notably the Ray clusters that run

50% of the company's offline ML workload ("tens of thousands of Ray clusters per month").

Shape (as disclosed)

  • Zonal clusters. One Kubernetes cluster per AWS Availability Zone — "The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone)." Blast-radius containment + latency locality; the downside is that cross-AZ configuration drift has a real operational impact (see the 2025 ENA-reset incident below).
  • GPU instance family for ML training: 96 vCPU per host.
  • AWS Deep Learning AMI (Ubuntu 20.04) as the base machine image at the time of the 2025 incident. The DLAMI default-systemd-units are what introduced the latent bug.
  • Taints + tolerations used as the reservation mechanism for controlled debugging hosts: "Reserved a small number of machines (via Kubernetes taints) for analysis" — see patterns/reserved-host-repro-env.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — PinCompute + ML Platform joint 3-month investigation into intermittent Ray training-job crashes traced to AWS ENA driver resets triggered by kubelet CPU starvation on a single core, in turn caused by zombie memory-cgroup accumulation from a crash-looping ecs-agent systemd unit shipped by the Deep Learning AMI. First wiki-canonicalised disclosure of PinCompute's zonal K8s shape + its GPU-fleet operational discipline.
Last updated · 319 distilled / 1,201 read