Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer¶
Summary¶
Guangtong Bai, Shantam Shorewala, Chi Zhang, Neha Upadhyay, and Haoyang Li (Pinterest Product ML Infrastructure + AI Platform) document Feature Trimmer, the system that unblocked fleet downsizing on Pinterest's online-ML-serving root-leaf architecture by eliminating unused features from the per-request fan-out payload between CPU-bound root hosts and GPU-bound leaf partitions. The post frames the architectural bet: the root-leaf decoupling enables simplified model onboarding + a shared in-memory feature cache + CPU/GPU resource separation, but it also moves ML-feature traffic onto the network and Pinterest was "[scaling] the system based on network usage rather than compute" — network-bound, not compute-bound. Two remedies are described in sequence. (1) fbthrift-level lz4 compression cut root→leaf bandwidth by 20% at the cost of +5% CPU and +5 ms (~10%) p90 latency — a solid but modest win that didn't change the underlying "we're shipping too much unused data" shape. (2) Send What You Use via Feature Trimmer — the root trims each fan-out request down to exactly the feature allowlist that the destination leaf model actually needs, keyed on model name + version and sourced from the PyTorch model signature (module_info.json). Production wins: Ads root cluster downsized 27%, Homefeed root cluster downsized 33%, Ads leaf peak network dropped from 1000–1200 MBPS to <200 MBPS, Homefeed leaf saw 65–75% inbound-network reduction, Ads AdMixer client p90 dropped from >90 ms to <80 ms, Related Pins p99 dropped 25–30%, Search and Notification egress dropped 45% and 65% respectively (with ≥30% cost reduction from instance-type move), and $4M+/year in infrastructure savings overall (including $0.98M from Search/Notification alone) plus a +0.17% revenue lift from fewer timeout failures. The bottleneck was "effectively shifted from network to CPU cycles on the root cluster." Part I of a multi-part series; Part II will cover client→root feature compression.
Key takeaways¶
-
The root-leaf architecture made network the scaling axis, not compute. Pinterest's explicit framing: "the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute." On leaf partitions, peak network usage was significantly higher than peak GPU SM activity — "the network bottleneck prevented us from fully utilizing the available GPU compute power." On root, Pinterest had to use the network-optimized AWS m6in instance type (~20% more expensive than standard m6i) to meet latency SLA. Canonical instance of concepts/network-bound-vs-compute-bound at ML-serving altitude.
-
The first remedy — fbthrift lz4 compression — is a modest 20% lever that doesn't change the underlying shape. "After a few quick tests, we enabled lz4 compression in fbthrift (the RPC framework used by root and leaf) for root-leaf traffic. That reduced 20% root-leaf network usage, at the cost of 5% CPU usage increase and 5ms (~10%) p90 latency increase." Explicit acknowledgement: "Compression was a solid early win, but it didn't change the underlying problem: we were still shipping too much unused data." Canonical compression codec trade-off data point: lz4 for low-CPU / low-latency vs higher ratios.
-
Send What You Use is the bigger lever — potentially ~50% network reduction. Motivation: "the root fetches the union of features needed across models (per candidate Pin), stores them in an efficient in-memory cache, and then fans out the full feature set to each leaf model. Each model converts and uses only the features it needs; the rest are effectively discarded before inference." In the prior single-host architecture this was acceptable (only memory cost); in the root-leaf architecture the unused features become network cost. Explicit analogy: "similar to C++'s 'include what you use' header management tool removing unnecessary #include's." Canonicalises the primitive at the RPC-payload altitude.
-
The model signature is the source of truth —
module_info.jsonexported alongside the TorchScript.ptarchive. "A crucial convention is that a model's signature remains unchanged across different versions. If a signature modification is necessary — for instance, to introduce a new input feature — a new model is forked from the original." Canonical concepts/model-signature-as-source-of-truth disclosure: model signature is treated as an API contract, with version-stability invariants and a fork-on-incompat rule. The same artefact is consumed by (a) the leaf's feature converter that transforms internal-format features into PyTorch tensors, and (b) the root's Feature Trimmer allowlist. -
The signature artefact rides the existing model-deployment pipeline — no new control plane. Pinterest's stated approach: "treat the model signature as the source of truth … publish signatures as lightweight artifacts that can be consumed by deployment pipelines … aggregate per-model signatures into a per-bundle artifact that is deployed to the root alongside existing root configs … use the same staged delivery semantics as model rollout (canary, automated canary analysis, prod, rollback), so trimmer config changes ride the same operational rails as everything else." Canonical patterns/artifact-rides-model-deploy-pipeline instance — the trimmer config is not in a separate config system with a separate canary flow, it is welded onto the existing per-bundle model-deploy artefact. Concretely: root configs deploy to Canary first, then model configs deploy to Canary, then Automated Canary Analysis, then root configs to Production, then model configs to Production — root configs always lead so that when a new leaf model version arrives, a matching allowlist is already present on root.
-
Allowlist beats blocklist because feature sets grow-and-shrink unpredictably. "This allowlist approach, compared to a blocklist where we keep features not in the list, does not carry the burden of tracking all the features that might be in development or deprecated. Given the evolving nature of ML models and volume of experiments at Pinterest, the blocklist is significantly larger for any given model and it is probable that it will grow faster than the allowlist in the future." Canonical patterns/feature-allowlist-over-blocklist articulation: ML-feature universes are monotonically growing + high-churn, so the blocklist is always the larger and stalier set. The allowlist is defined by what a model trained on; the blocklist would have to track every experimental and deprecated feature ever introduced.
-
The consolidated in-memory map uses model-name + version as nested keys with file-watcher-driven atomic swap. "A feature trimmer module is initialized on each root host when it comes online. This module maintains a consolidated, in-memory mapping from models to their versioned feature allowlist." Refresh mechanism: each bundle's
module_info.jsonhas a file watcher; any content refresh reloads that bundle's map; the module then "scans and merges all independent maps, creates a new consolidated map, atomically replaces the current active consolidated map with the new one" under a read-write lock (shared lock for reads, unique lock for the swap). Canonical patterns/file-watcher-atomic-swap-consolidated-map instance — independent per-bundle maps + whole-map rebuild + atomic swap, so a corrupt bundle only affects its own slice. -
Versioned lookup with latest-version fallback — based on the invariant that signatures are version-stable. "Each scoring request sent to the root cluster must include the model name and optionally, the model version. If the version is omitted, it defaults to the latest version. The feature trimmer parses these fields to determine the version-specific feature allowlist for the requested model." Three branches: (a) no allowlist for model → untrimmed passthrough (fail-safe to untrimmed); (b) model + version found → version-specific allowlist used; (c) model found but version missing/unfound → fall back to latest version. Fallback works because signatures are stable across versions (a forked model is treated as a new model). This avoids needing multiple versions in memory during rolling deploys — a capacity-vs-correctness trade explicitly called out.
-
Trimmer is on the critical failure path → three safeguards. (i) Init-failure railguard: parsing failures alert on-call but don't block host launch — "this decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself." (ii) Per-bundle isolation: each bundle's signature in its own map means one corrupt bundle falls back to its previous-in-memory version while others keep updating. (iii) Root configs ship backwards-compatible allowlists for both current and pending versions during rollout, so versioned requests hitting between root-configs-deploy and leaf-model-deploy still match an allowlist. Canonical skip-on-missing-allowlist-for-safety articulation: "if a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs."
-
Quantified wins and the shifted bottleneck. Ads root cluster: network dropped from peak 4 GBPS to <1.5 GBPS even after downsizing by 27%. Ads leaf partitions: peak usage dropped from 1000–1200 MBPS to <200 MBPS for all clusters; the reduction "allowed us to tune the cluster size and batch size config to improve the GPU utilization" (representing roughly 5% of total GPU capacity). Ads AdMixer client p90: peaked >90 ms pre-launch, <80 ms peak post-launch. Homefeed root outbound: dropped from ~1.2–2.1 GB/s to ~0.45–1.1 GB/s; root fleet downsized 33%. Homefeed leaf inbound: 65–75% reduction across GPU leaf clusters. Related Pins model p99: ~130–180 ms → ~95–125 ms (25–30% drop). Search and Notification egress: 45% and 65% drops respectively; because initially network-bound, the clusters moved to more-optimized instance types for ≥30% cost reduction — $0.98M annual from rightsizing. Overall: $4M+/year saved + headroom for bigger models + +0.17% revenue lift from fewer timeouts. Pinterest's framing of the end state: "It effectively shifted the bottleneck from network to CPU cycles on the root cluster." — a textbook bottleneck-relocation.
Architecture¶
Root-leaf ML serving architecture¶
The substrate (Source: body, "Background"):
Client Service ───► Online ML Serving System ───► score responses
│
├── Root (CPU, m6in / m6i)
│ - fetch features from feature store
│ - preprocess
│ - fan out per-candidate score requests to leaves
│ - **Feature Trimmer module** (per-host, per-bundle maps)
│
└── Leaf partitions (GPU)
- one partition per related group of models
- each partition: production model + experimental variants
- feature converter (from model_signature)
- model inference
Named benefits of the root-leaf split: - Simplified model onboarding — new models get new leaf partitions without touching root or clients. - Reduced feature-store QPS — all leaf partitions share the root's in-memory feature cache. - Optimized resource utilization — CPU (feature fetch / preprocessing) separated from GPU (inference).
Named cost: - Network bandwidth between root and leaf is now the scaling axis — Pinterest had to scale on network, not compute, and had to use network-optimized instance types on root to meet SLA.
Feature Trimmer flow¶
[Client score request: model_name, model_version?]
│
▼
[Root] fetch feature union, preprocess, cache
│
├── consult Feature Trimmer
│ ├── lookup consolidated_map[model_name][model_version]
│ │ ├── hit → use version-specific allowlist
│ │ ├── model found, version missing → latest-version fallback
│ │ └── no allowlist → skip trimming (pass through untrimmed)
│ └── return feature allowlist
│
▼
[Root] trim fan-out payload to allowlist, compress (lz4), send over fbthrift
│
▼
[Leaf partition] decompress, decode, feature_converter(features) → tensor, infer
Model-deploy pipeline integration¶
[Model training]
│
├── exports model.pt (TorchScript)
├── exports archive/extra/module_info.json ← signature artefact
│ { input_names:[...], output_names:[...] }
│
[Bundle build]
│
├── iterates over model versions in the bundle
├── for each: if module_info.json exists, parse + record
│ else: log warning, skip (don't fail the build)
└── produces per-bundle module_info mapping:
{
"model_A": [
{ "version":"1", "input_names":[...], "output_names":[...] },
{ "version":"2", "input_names":[...], "output_names":[...] }
],
"model_B": [ { "version":"7", ... } ]
}
│
[Deploy]
├── 1. root configs → Canary (leads so allowlist is present in advance)
├── 2. model configs → Canary
├── 3. Automated Canary Analysis (ACA)
├── 4. root configs → Production
└── 5. model configs → Production
On-host trimmer internals¶
Feature Trimmer Module (per root host, in-process)
│
├── independent_maps[bundle] (loaded from each bundle's module_info.json)
│ ▲
│ └── file watcher per module_info.json → triggers reload for that bundle
│
├── consolidated_map { model_A: { version_N: allowlist, version_M: allowlist }, ... }
│ ▲
│ └── rebuilt from all independent_maps, atomically swapped in
│
└── RW lock
├── shared lock on reads of consolidated_map + independent_maps
└── unique lock on atomic swap of consolidated_map
Operational numbers¶
- fbthrift lz4 compression lever: −20% root-leaf bandwidth, +5% CPU, +5 ms (~10%) p90 latency.
- Ads root network: 4 GBPS peak → <1.5 GBPS peak; cluster fleet −27%.
- Ads leaf peak network: 1000–1200 MBPS → <200 MBPS across clusters.
- Ads leaf GPU: ~5% of total GPU capacity unlocked via batching-config tuning post-trim.
- Ads AdMixer client p90: >90 ms → <80 ms peak.
- Homefeed root outbound: ~1.2–2.1 GB/s → ~0.45–1.1 GB/s; cluster fleet −33%.
- Homefeed leaf inbound: −65–75% across GPU leaf clusters; rightsizing in progress at publication time.
- Related Pins model p99: ~130–180 ms (frequent >200 ms spikes) → ~95–125 ms (−25–30%).
- Search egress: −45%; Notification egress: −65%; both clusters moved to non-network-optimized instance types; ≥30% cost reduction on both.
- Infrastructure savings: $4M+ / year, including $0.98M / year from Search + Notification right-sizing alone.
- Revenue: +0.17% from fewer timeout-induced failures on Ads.
- Bottleneck shift: from network-bound to CPU-bound on root cluster.
- Next lever: client→root feature compression (subject of Part II).
Caveats¶
- Part I of a series — client→root feature traffic (likely a bigger absolute payload than root→leaf because it carries per-candidate features pre-fan-out) is explicitly deferred to Part II. The Feature Trimmer addresses root→leaf only.
- Model signature version-stability invariant is load-bearing. The whole fallback-to-latest-version design relies on "a model's signature remains unchanged across different versions" — enforced socially by the convention that a signature change forces a new model name, not by tooling disclosed in the post. If this convention breaks, the fallback path silently sends wrong features.
- Per-bundle failure-isolation story is partial. The post says "if a model bundle's file gets corrupted on disk during an update, the feature trimmer keeps using the old, in-memory version for that bundle" — this means the trimmer can run on a silently-stale allowlist when a bundle update fails in transit. Mitigated by the on-call alert at init time but not at runtime.
- The 27% / 33% root-cluster downsizing numbers are Ads and Homefeed only. Other use cases (Search, Notification) saved via instance-type migration rather than raw fleet shrinkage. The $4M aggregate is across all use cases.
- No disclosure of: (a) the actual feature-trim ratio distribution (how many features are trimmed per model — the "~50%" estimate in the motivation is theoretical, not measured), (b) the Feature Trimmer's own CPU/latency cost on root (the post focuses on net wins, not trimmer overhead), (c) any trimmer-specific observability dashboards beyond the init-time on-call alert.
- AWS instance types disclosed: m6in (network-optimized, pre-launch root) and m6i (standard, post-launch root on Ads). Search/Notification post-launch instance type unspecified.
- fbthrift is the Facebook/Meta fork of Apache Thrift — not explicitly branded as such in the post, but that is the canonical identity.
- GPU SM activity is named as the compute-side utilization metric the bottleneck was preventing Pinterest from fully consuming — not raw GPU occupancy, specifically Streaming-Multiprocessor activity.
Source¶
- Original: https://medium.com/pinterest-engineering/optimizing-ml-workload-network-efficiency-part-i-feature-trimmer-ae20beb08d69?source=rss----4c5a5f6279b6---4
- Raw markdown:
raw/pinterest/2026-05-01-optimizing-ml-workload-network-efficiency-part-i-feature-tri-f7a26d34.md - Referenced tooling: include what you use (IWYU)
Related¶
- systems/pinterest-feature-trimmer — the headline system canonicalised by this post.
- systems/pinterest-ml-serving-root-leaf — the substrate architecture.
- systems/fbthrift — the RPC framework carrying root-leaf traffic.
- systems/pytorch — substrate for the
.ptarchive + TorchScript +module_info.json. - concepts/root-leaf-ml-serving-architecture — the architectural primitive.
- concepts/feature-fanout-network-bottleneck — the problem.
- concepts/send-what-you-use — the overarching principle.
- concepts/model-signature-as-source-of-truth — the design invariant enabling trimming.
- concepts/network-bound-vs-compute-bound — the scaling-bottleneck framing.
- concepts/compression-codec-tradeoff — the lz4 lever data point.
- patterns/feature-allowlist-over-blocklist — why allowlist beats blocklist for ML feature sets.
- patterns/artifact-rides-model-deploy-pipeline — reuse of staged model-deploy for trimmer config.
- patterns/file-watcher-atomic-swap-consolidated-map — per-bundle reload with atomic consolidated-map swap.
- patterns/skip-on-missing-allowlist-for-safety — untrimmed passthrough on lookup miss.
- companies/pinterest — the operator.
- sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces — sibling Ads-ML post; this post is the infrastructure-side complement to that model-side post.
- sources/2026-05-01-netflix-state-of-routing-in-model-serving — Netflix's same-day canonicalisation of its ML-serving routing layer; same architectural altitude (client-facing ML serving), different axis (routing decoupling vs payload trimming).