CONCEPT Cited by 1 source
Root-leaf ML serving architecture¶
Definition¶
Root-leaf ML serving architecture is an online-inference system shape where a single root tier handles feature retrieval + preprocessing on CPU, and a fleet of leaf partitions handles model inference on GPU. Each inbound score request arrives at root, which fetches the union of features needed across relevant models from the feature store, caches them in memory, and fans out per-candidate score requests over RPC to each leaf partition that should score each candidate. Results are gathered back at root and returned to the client.
Canonicalised on the wiki from Pinterest's 2026-05-01 Feature Trimmer post. (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer)
Why the split¶
Pinterest's three named benefits for splitting feature fetch/preprocessing from inference:
- Simplified model onboarding — new models get new leaf partitions transparently to root and clients. Model release cadence decouples from client / root release cadence.
- Reduced feature-store QPS — the root holds a shared in-memory cache that fronts feature reads for all leaf partitions, so the feature store sees requests at root-cardinality, not root × models cardinality.
- CPU / GPU resource separation — CPU-bound work (feature I/O + feature preprocessing) runs on CPU instance types; GPU-bound inference runs on GPU instance types. Each tier right-sizes independently.
Structural failure mode¶
The split moves feature data onto the network. Pre-split, features flowed within a single host's memory; post-split, they flow over RPC between root and leaf. Because the root fetches the union of features across relevant models and each leaf model uses only a subset, the fan-out payload contains features that will be discarded post-arrival. Over enough fan-out volume, this creates a feature-fan-out network bottleneck where serving scales on network, not compute.
Pinterest's framing: "the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute" — network-bound, not compute-bound.
Remedies at the root-leaf boundary¶
The 2026-05-01 Pinterest post names two complementary levers:
| Lever | Mechanism | Typical impact |
|---|---|---|
| RPC-layer compression (fbthrift lz4) | squeeze bytes opaquely on the wire | −20% bandwidth, +5% CPU, +5 ms p90 |
| Send What You Use (Feature Trimmer) | trim fan-out payload to each model's exact allowlist | ~50% theoretical; 27–33% root downsize + 65–75% leaf inbound reduction in practice |
The compression lever is structurally modest because it accepts that unused features will be sent. The Send-What-You-Use lever attacks the problem at its root — don't send unused features at all.
Structural siblings¶
- Scatter-gather query (concepts/scatter-gather-query) — same two-tier shape at the search / query altitude; the root/aggregator fans out to shards, gathers partial results, merges.
- Netflix's ML serving platform (systems/netflix-model-serving-platform) — a sibling ML-serving platform with a different split: Lightbulb does routing resolution (out of payload path) and Envoy forwards requests to the selected backend. Netflix's split is about routing (which backend handles this Objective) rather than feature fan-out (every leaf scores a subset of features). Both arise from pressures of multi-model, multi-surface production inference; they compose at different altitudes.
- Prior single-host ML serving — before root-leaf, Pinterest had "the same GPU host handled both feature fetching/preprocessing and local model inference." The root-leaf split's costs (network fan-out) are the price of the benefits (model-onboarding simplicity + shared feature cache + CPU/GPU tier right-sizing).
Seen in¶
- 2026-05-01 Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer (sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer) — canonical; names both benefits (3) and costs (network bottleneck); quantifies Pinterest's scaling-on-network pain and the Feature Trimmer + lz4 remedies.
Caveats¶
- Distinguished from scatter-gather query altitude — same shape, different workload: ML feature fan-out carries feature payloads per candidate; scatter-gather query carries query predicates + receives row results. Network cost characteristics differ substantially.
- Feature-fan-out payload shape makes this structurally costlier than most scatter-gather queries — features per candidate can be large (user embedding sequences, aggregated interaction histories), and the fan-out cardinality is
#candidates × #leaf-partitions, not just#shards. - First canonical wiki instance is Pinterest's; expect analogous shapes at Meta (Ads ranking), Google (recsys), TikTok (ByteDance recsys) if/when those posts disclose.
Related¶
- systems/pinterest-ml-serving-root-leaf — Pinterest's canonical instance.
- systems/pinterest-feature-trimmer — the payload-trimming remedy.
- concepts/feature-fanout-network-bottleneck — the failure mode this architecture introduces.
- concepts/send-what-you-use — the structural remedy.
- concepts/network-bound-vs-compute-bound — the scaling-bottleneck framing.
- concepts/scatter-gather-query — sibling shape at search / query altitude.
- systems/netflix-model-serving-platform — sibling ML-serving altitude, different decomposition axis.