Skip to content

CONCEPT Cited by 1 source

Root-leaf ML serving architecture

Definition

Root-leaf ML serving architecture is an online-inference system shape where a single root tier handles feature retrieval + preprocessing on CPU, and a fleet of leaf partitions handles model inference on GPU. Each inbound score request arrives at root, which fetches the union of features needed across relevant models from the feature store, caches them in memory, and fans out per-candidate score requests over RPC to each leaf partition that should score each candidate. Results are gathered back at root and returned to the client.

Canonicalised on the wiki from Pinterest's 2026-05-01 Feature Trimmer post. (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer)

Why the split

Pinterest's three named benefits for splitting feature fetch/preprocessing from inference:

  1. Simplified model onboarding — new models get new leaf partitions transparently to root and clients. Model release cadence decouples from client / root release cadence.
  2. Reduced feature-store QPS — the root holds a shared in-memory cache that fronts feature reads for all leaf partitions, so the feature store sees requests at root-cardinality, not root × models cardinality.
  3. CPU / GPU resource separation — CPU-bound work (feature I/O + feature preprocessing) runs on CPU instance types; GPU-bound inference runs on GPU instance types. Each tier right-sizes independently.

Structural failure mode

The split moves feature data onto the network. Pre-split, features flowed within a single host's memory; post-split, they flow over RPC between root and leaf. Because the root fetches the union of features across relevant models and each leaf model uses only a subset, the fan-out payload contains features that will be discarded post-arrival. Over enough fan-out volume, this creates a feature-fan-out network bottleneck where serving scales on network, not compute.

Pinterest's framing: "the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute"network-bound, not compute-bound.

Remedies at the root-leaf boundary

The 2026-05-01 Pinterest post names two complementary levers:

Lever Mechanism Typical impact
RPC-layer compression (fbthrift lz4) squeeze bytes opaquely on the wire −20% bandwidth, +5% CPU, +5 ms p90
Send What You Use (Feature Trimmer) trim fan-out payload to each model's exact allowlist ~50% theoretical; 27–33% root downsize + 65–75% leaf inbound reduction in practice

The compression lever is structurally modest because it accepts that unused features will be sent. The Send-What-You-Use lever attacks the problem at its root — don't send unused features at all.

Structural siblings

  • Scatter-gather query (concepts/scatter-gather-query) — same two-tier shape at the search / query altitude; the root/aggregator fans out to shards, gathers partial results, merges.
  • Netflix's ML serving platform (systems/netflix-model-serving-platform) — a sibling ML-serving platform with a different split: Lightbulb does routing resolution (out of payload path) and Envoy forwards requests to the selected backend. Netflix's split is about routing (which backend handles this Objective) rather than feature fan-out (every leaf scores a subset of features). Both arise from pressures of multi-model, multi-surface production inference; they compose at different altitudes.
  • Prior single-host ML serving — before root-leaf, Pinterest had "the same GPU host handled both feature fetching/preprocessing and local model inference." The root-leaf split's costs (network fan-out) are the price of the benefits (model-onboarding simplicity + shared feature cache + CPU/GPU tier right-sizing).

Seen in

Caveats

  • Distinguished from scatter-gather query altitude — same shape, different workload: ML feature fan-out carries feature payloads per candidate; scatter-gather query carries query predicates + receives row results. Network cost characteristics differ substantially.
  • Feature-fan-out payload shape makes this structurally costlier than most scatter-gather queries — features per candidate can be large (user embedding sequences, aggregated interaction histories), and the fan-out cardinality is #candidates × #leaf-partitions, not just #shards.
  • First canonical wiki instance is Pinterest's; expect analogous shapes at Meta (Ads ranking), Google (recsys), TikTok (ByteDance recsys) if/when those posts disclose.
Last updated · 445 distilled / 1,275 read