PINTEREST Tier 2

Pinterest — Unifying Ads Engagement Modeling Across Pinterest Surfaces¶

Duna Zhan, Qifei Shen, Matt Meng, Jiacheng Li, and Hongda Shen (Pinterest Ads ML) describe the consolidation of Pinterest's ads engagement CTR-prediction stack from three independent per-surface models (Home Feed, Search, Related Pins) into a single unified model with surface-specific calibration and tower trees. The post is an architectural retrospective + rollout playbook: why fragmentation hurt, how Pinterest unified step-by-step, and what serving-efficiency work was needed so the bigger unified model didn't blow the latency budget.

Summary¶

Before the project, Pinterest ran three production engagement models — one per surface (Home Feed / HF, Search / SR, Related Pins / RP). They were "initially derived from a similar design" but diverged over time in user sequence modeling, feature crossing modules, feature representations, and training configurations. This fragmentation caused three concrete pains: low iteration velocity (platform-wide improvements required duplicating work across codepaths; hyperparameters tuned for one surface didn't transfer), redundant training cost (each idea revalidated three times), high maintenance burden (three materially different systems to operate, debug, evolve).

Three unification principles guided the work: (1) start simple — merge the strongest existing components across surfaces as a baseline; (2) iterate incrementally — introduce surface-aware modeling (multi-task heads, surface-specific exports) only after the baseline demonstrates clear value; (3) maintain operational safety — safe rollout, monitoring, fast rollback at every step.

Because RP had a substantially higher compute cost profile, Pinterest sequenced the unification by cost: first unify HF+SR (similar CUDA throughput characteristics), then expand to RP after throughput + efficiency work stabilised. The baseline unified model — union features + merge modules + combine training datasets across three surfaces — delivered "promising offline improvements" but also materially increased training and serving cost. Architecture refinement followed.

The final unified HF+SR architecture combines MMoE (Multi-gate Mixture of Experts) and long user sequence Transformers — two elements that "did not produce consistent gains" when applied in isolation to a single surface, but delivered stronger improvements with a reasonable cost profile when integrated into a unified model trained on combined HF+SR features and multi-surface data. Structurally the final architecture is a single unified model that serves multiple surfaces, plus surface-specific tower trees (and surface-specific modules within those tower trees) that "handle only that surface's traffic, avoiding unnecessary compute cost from modules that don't benefit other surfaces." At the time of writing the unified model contains HF and SR tower trees; RP is the next milestone.

Two surface-aware refinements were added on top of the HF+SR baseline:

Surface-specific calibration. A single global calibration layer is suboptimal because it "implicitly mixes traffic distributions across surfaces." Pinterest introduced a view-type-specific calibration layer — HF and SR calibrated separately — which improved online performance over shared calibration.
Multi-task learning + surface-specific checkpoint exports. A single shared architecture limited flexibility for surface-specific features and modules. Pinterest added multi-task heads + exported separate surface-specific checkpoints so each surface can adopt the most appropriate architecture "while still benefiting from shared representation learning" — the foundation for continued per-surface iteration without forking codebases.

Three efficiency optimisations were required to make the unified model serve at cost:

DCNv2 projection layer before Transformer output crossing. Project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers — "reduced serving latency while preserving signal."
Fused kernel embedding + TF32. Fused kernels improved inference latency; TF32 accelerated training.
Request-level user-embedding broadcasting. On serving, "instead of repeating heavy user embedding lookups for every candidate/request in a batch, we fetch embeddings once per unique user and then broadcast them back to the original request layout, keeping model inputs and outputs unchanged." Trade-off: an upper bound on the number of unique users per batch — if exceeded, the request can fail — so Pinterest uses a tested unique-user number to keep the system reliable.

Evaluation: offline improvements on HF and SR were validated by online A/B experiments ("significant improvements on both online and offline metrics"), with a reference to Pinterest internal data (US, 2025) for the numbers themselves (no concrete deltas disclosed in the post).

Forward direction: unify the RP surface next, with additional efficiency work to meet the tighter RP performance targets.

Key takeaways¶

Three surface-specific ads engagement models → one unified model with surface-specific tower trees. "Pinterest ads show up across multiple product surfaces, such as the Home Feed, Search, and Related Pins. Each surface has different user intent and different feature availability, but they all rely on the same core capability: predicting how likely a user is to engage with an ad. Before this project, the ads engagement stack relied on three independent production models, one per surface." Canonical wiki statement of the unified multi-surface model pattern — shared representation learning across surfaces, surface-specific divergence only where necessary. The fragmentation cost that motivated consolidation: low iteration velocity, redundant training cost, high maintenance burden — a concrete enumeration of the failure mode that justifies the pattern. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Cost sequencing — unify cheap-throughput surfaces first, expensive surfaces later. "Since the cost of Related Pins (RP), Home Feed (HF), and Search (SR) differ substantially, we first unified Home Feed and Search (similar CUDA throughput characteristics) and expanded to Related Pins only after throughput and efficiency work stabilized." Canonical wiki statement of the staged model unification pattern — sequence the unification by shared CUDA throughput budget, not by surface priority or traffic volume. Pairing surfaces with similar cost profiles is what makes the "merge modules + combine datasets" baseline viable at production scale; mismatched cost profiles would force efficiency work upfront. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
The baseline union-and-merge failed on cost. "As a first step, we built a baseline unified model by: Unioning features across the three surface models, Merging existing modules into a single architecture, and Combining training datasets across surfaces. This baseline delivered promising offline improvements, but it also materially increased training and serving cost. As a result, additional iterations were required before the model was production-ready." Canonical wiki instance of the unified-model baseline latency-regression failure mode — merging strong per-surface components into one architecture wins on offline metrics but loses on serving cost until paired with efficiency work. A useful counter-example to "just use bigger models" — when every ad candidate pays the union-of-features cost, the throughput collapses. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
MMoE + long user sequences only worked as a combined unified system — neither worked standalone. "We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]. When applied in isolation (e.g., MMoE on HF alone, or long sequence Transformers on SR alone), these changes did not produce consistent gains, or the gain and cost trade-off was not favorable. However, when we integrated these components into a single unified model and expanded training to leverage combined HF+SR features and multi-surface training data, we observed stronger improvements with a more reasonable cost profile." The load-bearing architectural claim of the post. Composition of known-good components into a unified architecture is not a simple sum — the integration is where the win lives, because MMoE benefits from surface-mixed training data (expert routing learns surface-specific behaviour) and long-sequence Transformers benefit from broader feature coverage (more diverse sequences). Each component needs the other's data-distribution shift to clear its cost bar. Canonical wiki statement of the "composition beats isolation" behaviour in multi-surface recommender systems. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Surface-specific tower trees inside a unified model. "The diagram below shows the final target architecture: a single unified model that serves three surfaces, while still supporting the development of surface-specific modules (for example, surface-specific tower trees and late fusion with surface-specific modules within those tower trees). During serving, each surface-specific tower tree and its associated modules will handle only that surface's traffic, avoiding unnecessary compute cost from modules that don't benefit other surfaces. As a first step, the unified model currently includes only the HF and SR tower trees." Canonical wiki instance of the surface-specific tower tree pattern — the architectural mechanism that keeps the unified model from degenerating into "every surface pays for every other surface's specialisation". Shared trunk (where generalisation lives) + surface-specific tower trees (where specialisation lives) + surface-aware routing at serving time. Structurally analogous to MMoE gates but applied at the tower granularity (whole subnetworks), not the expert granularity. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Surface-specific calibration beats shared calibration. "Since the unified model serves both HF and SR traffic, calibration is critical for CTR prediction. We found that a single global calibration layer could be suboptimal because it implicitly mixes traffic distributions across surfaces. To address this, we introduced a view type specific calibration layer, which calibrates HF and SR traffic separately. Online experiments showed this approach improved performance compared to the original shared calibration." Canonical wiki instance of the surface-specific calibration concept — a view-type-specific calibration layer is a narrow architectural refinement with measurable online wins. Generalises beyond surfaces: any unified CTR model serving heterogeneous traffic distributions (surface, user segment, device type, country) should consider per-segment calibration rather than a single head. The failure mode being avoided: a calibration layer trained on mixed distributions systematically mis-calibrates each sub-distribution. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Multi-task learning + surface-specific checkpoint exports preserve per-surface iteration velocity. "Using a single shared architecture for HF and SR CTR prediction limited flexibility and made it harder to iterate on surface-specific features and modules. To restore extensibility, we introduced a multi-task learning design within the unified model and enabled surface-specific checkpoint exports. We exported separate surface checkpoints so each surface could adopt the most appropriate architecture while still benefiting from shared representation learning. This enabled more flexible, surface-specific CTR prediction and established a foundation for continued surface-specific iteration." Canonical wiki instance of the surface-specific checkpoint export pattern — train one model, export N checkpoints, one per surface. Each surface deploys the version of the model that's been fine-tuned for its specific task head + architecture variant, while the underlying shared representation comes from the same joint training. This is the operational escape valve for unified-model rigidity — without it, the unified model becomes a coordination bottleneck that re-creates the fragmentation it was supposed to eliminate. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Model unification doesn't automatically reduce infra spend. "Infrastructure cost is mainly driven by traffic and per-request compute, so unifying models does not automatically reduce infra spend. In our case, early unified versions actually increased latency because merging feature maps and modules made the model larger. To address this issue, we paired it with targeted efficiency work." Canonical wiki statement of the unified-model cost fallacy — model consolidation only reduces development/maintenance cost, not serving cost. Serving cost is controlled by per-request compute × traffic, and the unified baseline typically increases both (bigger model, same traffic). Efficiency engineering must ship alongside the unification, not afterward. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
DCNv2 projection layer shrinks Transformer outputs before downstream crossing. "We simplified the expensive compute paths by using DCNv2 to project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers, which reduced serving latency while preserving signal." Canonical wiki statement of the projection-layer-for-latency concept — a dimensionality reduction layer inserted between a wide upstream encoder (Transformer) and downstream crossing/tower layers, using a learned projection (DCNv2 in this case, a deep cross network) to compress signal. Contrast with pure linear projection: DCNv2 learns feature crosses during the projection itself. The architectural move: accept the Transformer's expressive capacity at its output boundary, then deliberately narrow the representation before passing it through expensive downstream crossing. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)
Request-level user-embedding broadcasting amortises heavy embedding lookups across a batch. "On the serving side, we reduced redundant embedding table look up work with request-level broadcasting. Instead of repeating heavy user embedding lookups for every candidate/request in a batch, we fetch embeddings once per unique user and then broadcast them back to the original request layout, keeping model inputs and outputs unchanged. The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable." Canonical wiki instance of the request-level user-embedding broadcast pattern — fetch once per unique user per batch, broadcast to original request layout. Structurally a deduplication + broadcast optimisation that assumes a workload where the same user's embedding is looked up multiple times within a batch (typical of multi-candidate scoring: one user × N candidate ads). The tested-unique-user-count upper bound is the operational safety mechanism — batches that exceed it fail rather than silently stall or corrupt. Generalises beyond ads: any scoring service where per-entity heavy lookups are amortised across candidates (content recommendation, search ranking, ads) can apply the pattern. (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces)

Systems / components named¶

Pinterest Ads Engagement Model — the unified CTR-prediction model described in the post; serves Home Feed + Search with tower trees per surface; Related Pins tower tree is future work.
Pinterest Home Feed / Search / Related Pins — three ads surfaces with different user intent, feature availability, and cost profiles.
MMoE (Multi-gate Mixture of Experts) — architectural component for surface-aware expert routing; referenced as Pinterest's prior ads-engagement work via footnote [1].
Long user sequence Transformers — architectural component for user behaviour modeling; referenced via footnote [2].
DCNv2 — deep cross network v2; used as a projection layer between the Transformer output and downstream crossing + tower layers.
View type specific calibration layer — HF/SR-separated calibration head.
Fused kernel embedding — kernel fusion optimisation for embedding-lookup inference.
TF32 — NVIDIA Tensor-Float-32 training precision mode; used for training speedup.

Concepts / patterns canonicalised¶

concepts/ctr-prediction — click-through-rate prediction as the core ads engagement capability.
concepts/surface-specific-calibration — calibration-layer-per-traffic-distribution refinement.
concepts/multi-task-learning — shared-trunk-plus-task-heads for flexible per-surface iteration.
concepts/long-user-sequence-modeling — Transformer over user history as feature encoder.
concepts/mixture-of-experts — gated expert routing for heterogeneous inputs.
concepts/projection-layer-for-latency — dimensionality reduction between encoder + crossing.
concepts/request-level-embedding-broadcast — the deduplication + broadcast concept.
concepts/cuda-throughput-budget — GPU throughput profile as the cost axis for cost-sequencing unification.
patterns/unified-multi-surface-model — one model, multiple surfaces, shared representation.
patterns/surface-specific-tower-tree — surface-routed subnetworks inside a unified model.
patterns/surface-specific-checkpoint-export — N checkpoints from one training run.
patterns/request-level-user-embedding-broadcast — operational pattern for amortising heavy lookups.
patterns/staged-model-unification — sequence unification by cost profile.

Operational numbers¶

Three surfaces unified into one model — HF + SR first; RP is next milestone.
Three surface-specific tower trees in the final architecture (HF + SR at time of writing; RP planned).
"Significant improvements on both online and offline metrics" — headline win; specific percentages not disclosed in the post (reference to Pinterest internal data, US, 2025).
Upper bound on unique users per batch for request-level broadcasting — not numerically disclosed; tuned empirically.

Caveats¶

Ranking-architecture retrospective voice — few numbers. No A/B deltas, no latency percentiles, no per-request compute breakdown, no infra-cost breakdown, no model-size parameters, no training-compute comparison. Internal-data citation [3] without surface-level numbers. This limits the post to a shape + playbook retrospective rather than a quantitative case study.
No architecture diagram in the ingested text. The post references an embedded architecture diagram ("The diagram below shows the final target architecture") not included in the markdown. The description is sufficient to infer shape (shared trunk + surface-specific tower trees + late fusion + surface-specific modules) but not exact topology (expert count, head composition, tower depth, feature cross arity, calibration-head arch).
No MMoE-specific detail. MMoE is named as "[Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation]" (footnote [1]) but the specific variant (number of experts, gate structure, whether knowledge distillation from footnote [1] is used in the unified model) is not disclosed here.
No long-user-sequence Transformer topology. Footnote [2] references a prior Pinterest blog post on user-action-sequence modeling; this post doesn't re-state sequence length, attention heads, layer count, or feature tokenisation.
Baseline-regression recovery trajectory is described qualitatively. Pinterest says the baseline "materially increased training and serving cost" and describes efficiency work afterward, but not the time-to-parity or cost-recovery curve. Serving-latency deltas at each stage of the unification are not disclosed.
Surface-specific calibration improvement is qualitative. "Online experiments showed this approach improved performance compared to the original shared calibration" — directional claim only, no metric + delta.
RP integration open. The post ends mid-journey: RP is the next milestone; efficiency work required. This is a live-system retrospective, not a closed-project case study.
Tier-2 source self-assessment. Pinterest ads-ranking post is in scope for sysdesign-wiki (real production architecture at scale, clear cost/iteration/maintenance rationale, explicit serving-optimisation work) but the post is light on operational numbers — borderline pass the "numbers and internals" Tier-2 bar, but clears the "architectural density" bar.

Cross-source continuity¶

Extends MTML ranking: Pinterest's unified model with multi-task heads + surface-specific checkpoint exports is structurally a variant of MTML applied across surfaces rather than across task semantics. Meta's Friend Bubbles (2026-03-18) uses MTML within a single product surface (Reels) for multiple engagement tasks (watch / like / bubble-conditioned engagement); Pinterest uses MTML across multiple product surfaces (HF / SR / RP) for CTR prediction. Both treat task-specific heads as the extensibility escape valve from shared-trunk rigidity.
Parallels Meta Adaptive Ranking Model's model-system co-design (2026-03-31): both are LLM-era ads-ranking architectures that explicitly pair model-side changes (MMoE + long sequences at Pinterest; FP8 quantisation + grouped-GEMM at Meta) with serving-side efficiency work (request-level broadcast + DCNv2 projection at Pinterest; operator fusion + sequence scaling at Meta). Both posts establish the doctrine: unification or scale-up without efficiency work is a regression.
Complements sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest: Pinterest's second ads/ML ingest on the wiki, after the 2024 HBase deprecation retrospective. Together they paint a picture of a company deliberately consolidating fragmented stacks — HBase deprecation collapsed a bolt-on NoSQL+Sparrow+Ixia+Manas stack back to a single NewSQL substrate; this post collapses three surface-specific ads models back to one unified model. Both share the "fragmentation tax exceeds consolidation risk" thesis, applied at different system layers.
Contrasts with Pinterest's workload-specific datastore migration: the HBase post championed decomposing by workload (OLAP → Druid, time-series → Goku, KV → KVStore) as the path to efficiency. This post does the opposite on the model layer: consolidating surface-specific models into one. The difference is telling — at the storage layer, workload specialisation was an efficiency win; at the model layer, surface specialisation was an iteration-velocity loss. Different scales of specialisation make sense at different layers.
No existing-claim contradictions — strictly additive. First canonical wiki instance of surface-specific tower trees, surface-specific calibration, surface-specific checkpoint export, request-level user-embedding broadcast, and staged model unification by CUDA throughput.

References¶

[1] Li, Jiacheng, et al. Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development. Pinterest Engineering Blog.
[2] Lei, Yulin, et al. User Action Sequence Modeling for Pinterest Ads Engagement Modeling. Pinterest Engineering Blog.
[3] Pinterest Internal Data, US, 2025.