SYSTEM Cited by 1 source

Pinterest ML Serving Root-Leaf Architecture¶

Definition¶

Pinterest ML Serving Root-Leaf Architecture is the internal structure of Pinterest's online ML serving system: a two-tier split where root hosts do feature fetching and preprocessing on CPU, and leaf partitions run model inference on GPU. The tier boundary is an fbthrift RPC, with the root fanning out per-candidate score requests to each leaf model that should score each candidate. Canonicalised in the 2026-05-01 Pinterest Engineering post on Feature Trimmer. (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer)

Architecture¶

From the post's Background section:

Client Service  ──── score request ───►  Online ML Serving System
                                                │
                                                ├── Root (CPU)
                                                │     - feature fetch from feature store
                                                │     - preprocessing
                                                │     - per-candidate fan-out to leaf partitions
                                                │     - shared in-memory feature cache
                                                │     - AWS m6in (network-optimized) pre-trimmer
                                                │
                                                └── Leaf partitions (GPU)
                                                      - one partition per related group of models
                                                      - each partition hosts:
                                                          * one production model
                                                          * several experimental variants
                                                      - PyTorch feature converter + inference

Load-bearing properties¶

Benefits (the reason for the split)¶

Simplified model onboarding — a new ML model gets a new leaf partition, transparent to root and upstream clients. Decouples model shipping cadence from client / root cadence.
Reduced feature-store QPS — "the system minimizes RPCs to the feature store for fetching ML features by having all leaf partitions share a large in-memory feature cache in the root." One root cache fronts every model's feature reads.
Optimized resource utilization — "separating CPU (feature fetching, preprocessing) and GPU (model inference) workloads allows for optimized resource use, improving efficiency and reducing cost." Each tier uses its natural instance type.

Costs (the consequence of the split)¶

Feature fan-out over the network — features that were free in-memory transfers in the prior single-host architecture become RPC bytes on every fan-out. Pinterest's framing: "passing too many features from root to leaf created a network bottleneck."
Network becomes the scaling axis, not compute — Pinterest had to "scale the system based on network usage rather than compute." On leaf, peak network usage was significantly higher than peak GPU SM activity; on root, Pinterest had to use network-optimized AWS m6in instances (~20% more expensive than standard m6i) to meet latency SLA. Canonical instance of concepts/network-bound-vs-compute-bound.
The root fetches the union of features needed across all models — each leaf model then discards what it doesn't need. In the prior single-host architecture this was memory-only cost; in root-leaf it is network cost on every fan-out.

Levers Pinterest pulled to reclaim the network¶

fbthrift lz4 compression (modest win): −20% root-leaf bandwidth, +5% CPU, +5 ms (~10%) p90 latency.
Feature Trimmer (structural win): trim per-model per-version feature allowlist from the fan-out payload. Unlocked 27% root-cluster downsize on Ads, 33% on Homefeed, 65–75% leaf-inbound reduction; shifted the bottleneck from network to CPU-on-root. Part II of the series will address the client→root payload.

Seen in¶

2026-05-01 Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer (sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer) — canonical architecture disclosure; names the benefits + costs + both levers; quantifies the bottleneck-relocation win.

Caveats¶

Stub-to-system page — Pinterest's ML serving architecture is mentioned across many Pinterest engineering posts (the 2026-03-03 unified ads-engagement model post, the 2026-04-07 Home Feed MOO post, the 2026-04-13 DCAT post) but the root-leaf split itself is canonicalised here first. Expect this page to accumulate as more Pinterest ML-serving posts are ingested.
Pre-root-leaf architecture not fully disclosed — the post references "our prior architecture, where the same GPU host handled both feature fetching/preprocessing and local model inference", implying a single-host-per-model design predecessor. No dates or migration-timing disclosed.
Feature-store identity not named — referenced only as "the feature store".
Leaf partition count, model density, and GPU instance types not disclosed.

systems/pinterest-feature-trimmer — the optimisation module on root.
systems/fbthrift — the RPC substrate.
systems/pytorch — the leaf-side inference framework.
systems/pinterest-ads-engagement-model — a canonical leaf-hosted production model.
systems/pinterest-home-feed — another canonical consumer surface.
concepts/root-leaf-ml-serving-architecture — the architectural primitive generalised.
concepts/feature-fanout-network-bottleneck — the resulting failure mode.
concepts/network-bound-vs-compute-bound — the scaling-bottleneck framing.
companies/pinterest — the operator.