PINTEREST 2026-04-27

Pinterest — From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation¶

Summary¶

Pinterest Ads ML (Richard Huang, Yu Liu, Ziwei Guo, Andy Mao, Supeng Ge) document the two-generation journey of building and iterating the shopping conversion candidate generation model — a retrieval-stage two-tower model dedicated to optimising for offsite shopping conversions (checkout, add-to-cart) rather than onsite engagement. The first generation launched in 2023 with a multi-head architecture; the 2025 refresh reorganised the same loss components into a unified single-head multi-task architecture, added an advertiser-level loss, and adopted a parallel DCNv2 + MLP cross-layer architecture that became load-bearing enough to be adopted by all of Pinterest's production engagement retrieval models. The post names four concrete production wins — +42% recall@100 for conversion tasks vs the 2023 model, +2.3% shopping conversion volume and +2.7% impression-to-conversion rate from the original 2023 launch, +3.1% RoAS for US shopping campaigns after the 2025 refresh, +1.5% CTR / +2.2% CTR-over-30-seconds as byproducts of better conversion ranking — and a fifth architectural win, +11% offline recall@1000 from the parallel DCN + MLP cross-layer design in isolation.

Key takeaways¶

Offsite conversions are structurally hostile to ranking-model training — sparse, noisy, delayed, advertiser-reported. Pinterest frames this as "significantly sparser and noisier than onsite engagement signals" and names it as the motivation for a dedicated model separate from the engagement-based shopping retrieval pipeline they inherited. Dedicated conversion sparsity treatment is the structural separator between shopping-conversion CG and generic engagement CG.
Dual positive signal — conversion + engagement (click, repin) — with log-based re-weighting on click duration. The canonical framing: "We supplement primary conversion signals with onsite engagement data (clicks, repins). This broadens data coverage, improving model generalization and ad funnel survival rates." Click positives are reweighted by a log function of click duration (t seconds, capped at t_max) to suppress accidental / bounce clicks and emphasise dwell-time-confirmed engagement — a click-duration reweighting technique that converts a noisy binary label into a continuous-weight positive. The formula shape: w = f(log(1 + t / t_max)), where t_max is a tunable constant capping the reweight.
Ad impressions with no engagement serve as hard negatives on top of in-batch negatives. The contrastive training pool = in-batch negatives (cheap, abundant) + served-but-not-engaged ad impressions (semantically harder — they reflect "the real distribution of served ads, exposing the model to a more representative inventory and promoting robust contrastive learning"). Canonical to the two-tower retrieval discipline: in-batch negatives cover trivial separation; served-ads-with-no-engagement cover the boundary regions the deployed model actually sees.
Multi-task training with engagement as auxiliary task — but balancing task weights is the real challenge. "Our multi-task approach uses engagement prediction as an auxiliary task to stabilize training and boost performance. The crucial challenge is balancing the two tasks, ensuring the high-value conversion signal is not diluted by the more frequent engagement data." This is the canonical auxiliary-task regularisation framing at Pinterest — the abundant-auxiliary / sparse-primary MTL configuration where loss weighting is a first-class tuning surface.
Parallel DCNv2 + MLP cross layers beat sequential DCNv2 → MLP cross layers by +11% offline recall@1000. The sequential design "imposed a fundamental limit on the model's learning capacity" — "the MLP could only learn from features already processed by DCN v2, potentially losing valuable signals from the original input." The parallel design lets both networks learn directly from the same input features — cross network constructs "higher-order feature crosses without any information being lost or distorted by a preceding MLP transformation", 3-layer MLP "learns implicit abstract patterns in parallel", and their combined output is the final representation. Wide-applicability result: adopted by all Pinterest production engagement retrieval models after the conversion-CG validation — a retrieval-stage architectural primitive, not a shopping-specific trick.
Evolution from multi-head to unified single-head multi-task architecture (2023 → 2025). The 2023 model used two separate heads (engagement head + conversion head) trained with distinct sampled softmax losses and weighted losses; at serving time only the conversion head's Pin and query embeddings were used. The 2025 refresh merged the heads into a single unified multi-task head so "the final embeddings directly benefit from the multi-task optimization during serving." Structural driver: "sparsity and noise in the conversion labels" made per-head conversion embeddings unstable in low-coverage regions; fusing task signals into a single embedding head stabilises query embeddings "in regions of low conversion coverage."
Advertiser-level loss function as additional training objective to reduce Pin-level variance. "Conversion data at the Pin level exhibit high variance, making it challenging to reliably model purchase intent from Pin-level supervision alone. To address this, we introduce an advertiser-level loss function as an additional training objective, enabling the model to better capture conversion signals at a more stable and consistent granularity." The pattern: when per-item labels are too sparse/noisy to train on directly, group by a coarser-granularity entity (advertiser, seller, brand) and add a parallel loss at that granularity — stabilises training without abandoning per-item scoring.
Multi-surface model across Home Feed, Related Pins, Search — with surface-specific features for contextual differences. "We train a single model across all shopping surfaces (Homefeed, Related Pins, Search) to avoid fragmenting sparse conversion labels." Same consolidation motivation as the unified ads engagement model (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces): "At the same time, we incorporated surface-specific features to learn contextual differences between these surfaces." Surface-aware features inside the shared model — not separate surface models.

Architecture¶

Two-tower shape¶

Pinterest uses the standard retrieval-stage two-tower model — "user and Pin features are encoded separately, as there are no explicit user-Pin interaction features at this retrieval stage" — with the two towers extended by a parallel cross-layer architecture on top of each tower:

User features ──► [ User tower:  parallel DCNv2 + 3-layer MLP ]
                        │                  │
                        └──── concatenate ──┘
                                 │
                                 ▼
                              [ MLP head(s) ]
                                 │
                             user embedding
                                 │
                                 └──────────────┐
                                                ▼
                                       dot product → score
                                                ▲
                                                │
                             pin embedding ◄────┘

Pin features  ──► [ Pin tower:  parallel DCNv2 + 3-layer MLP ]
                        │                  │
                        └──── concatenate ──┘
                                 │
                                 ▼
                              [ MLP head(s) ]

Applied to both the Pin and query towers; the parallel design is a tower-internal architectural primitive, not an inter-tower one.

Feature engineering¶

Grouped into two categories:

User-side, split into two types:

Context features — real-time intent, vital for Related Pins and Search. Examples: subject Pin's visual embedding, GraphSage² Pin-graph embedding.
Preference + historical features — long-term interests for personalisation. Examples: demographics, aggregated historical actions, sequential data processed by a Transformer to create a user history embedding. This is Pinterest's canonical long user sequence modeling mechanism; also used in the unified engagement model.

Pin-side:

ID features.
Multi-modal / content features for semantic understanding.
Performance features tracking engagement.

Loss design evolution¶

2023 (multi-head):

                 ┌──────────── shared encoders ────────────┐
                 │                                         │
                 ▼                                         ▼
         [ Engagement head ]                      [ Conversion head ]
                 │                                         │
           sampled softmax                           sampled softmax
                 │                                         │
                 └─ weighted loss combination (task weights) ─┘

                    At serving: only conversion head embeddings used.

Engagement head stabilises shared parameters (abundant data); conversion head preserves purchase-intent signal (sparse data). Task-weighting tuned to prevent engagement signal from diluting the conversion signal.

2025 (unified multi-task):

                 ┌──────────── shared encoders + parallel DCNv2+MLP ────────────┐
                 │                                                              │
                 └──────────────────── single unified head ─────────────────────┘
                                                │
                           multi-task optimisation (conversion + engagement)
                                + advertiser-level loss as additional objective
                                                │
                                        served embeddings
                                   (directly benefit from multi-task opt)

Merging the heads means the final embeddings are jointly optimised for both tasks at training time, and the single embedding set is served — no serving-time selection of which head's output to use.

Parallel vs sequential cross layers¶

  Sequential (old):                    Parallel (new):

  input                                input
    │                                    │
    ▼                                ┌───┴───┐
  DCNv2 cross network                ▼       ▼
    │                            DCNv2     MLP (3 layers)
    ▼                                │       │
  MLP (dimension reduction)          └───┬───┘
    │                                    ▼
    ▼                                concatenate
  head                                   │
                                         ▼
                                        head

In sequential, MLP can only learn from DCNv2's output → information bottleneck. In parallel, both operate on the original input → cross network produces higher-order feature crosses lossless, MLP produces implicit abstract patterns in parallel. Combined representation is strictly richer.

Operational numbers¶

Pinterest's production datums (US, 2023-2025, citation "⁴", Pinterest Internal Data):

Metric	Delta	When
Parallel DCN+MLP vs sequential — offline recall@1000	+11%	2025 refresh (validated on conversion task, then generalised)
2025 conversion model vs 2023 conversion model — recall@100 (conversion tasks)	+42%	2025 refresh
Shopping conversion volume	+2.3%	2023 launch
Shopping impression-to-conversion rate	+2.7%	2023 launch
CTR	+1.5%	2023 launch (byproduct)
CTR over 30 seconds	+2.2%	2023 launch (byproduct)
RoAS (US shopping campaigns)	+3.1%	2025 refresh

Scale context from the post intro: "deploying it to our 600+ million monthly active users at Pinterest."

Caveats¶

No architecture diagrams in the ingested markdown — three referenced figures (click-duration reweighting formula, sequential vs parallel cross architecture, multi-head vs unified multi-task architecture) are "Press enter or click to view image in full size" placeholders in the scraped post. The shapes above are reconstructed from prose.
No exact hyperparameters. No parallel-MLP hidden dims, no DCNv2 cross-layer count, no task-loss-weighting scheme, no t_max value, no batch size, no unique-users-per-batch, no advertiser-level loss weighting, no serving latency numbers. Every tunable is described at the shape level only.
No embedding dimension. Pinterest doesn't disclose user/Pin embedding dimension or the serving ANN-index choice (though it almost certainly shares ads engagement model infrastructure — not stated in this post).
No latency / infra-cost delta. The post is a model-architecture retrospective — production wins are all quality metrics (recall@K, conversion volume, RoAS). No p50/p99/p999, no per-request compute, no cost envelope numbers.
No sequencing detail between conversion CG and engagement CG. Pinterest runs both; the post doesn't say how retrieved candidates from the two pipelines are merged/deduped, nor the relative recall share each contributes to the downstream funnel.
No training-data scale. No impression count, conversion count, engagement count, training-set size, or window length disclosed.
Auxiliary engagement task stability not quantified. The "crucial challenge is balancing the two tasks" framing is qualitative — no ablation of task-weight sweeps, no regularisation-strength ablation.
Per-advertiser loss reduces variance but doesn't address the sparsity ceiling. Advertiser-level loss adds granularity for stability, but the underlying sparsity of offsite conversions (sparse-noisy-delayed) remains the fundamental floor; the architectural levers (dual positive signal, advertiser-level loss, parallel cross layers, multi-task optimisation) stack on top of that floor, they don't eliminate it.
Prior-work references not ingested. Two cited Pinterest posts — Mudgal et al. 2024 Evolution of Ads Conversion Optimization Models at Pinterest, and GraphSage / DCNv2 papers — are referenced but not separately canonicalised here.

Source¶

companies/pinterest
systems/pinterest-shopping-conversion-cg
systems/pinterest-ads-engagement-model — sibling unified-model work; both leverage multi-surface consolidation + shared trunk + surface-specific features; both name parallel DCNv2 as architectural primitive.
systems/pinterest-home-feed · systems/pinterest-search · systems/pinterest-related-pins — three shopping surfaces unified by this model.
systems/dcnv2 — architectural component, used in parallel configuration.
systems/graphsage — subject-Pin embedding feature.
systems/transformer — user-history sequence encoder.
concepts/two-tower-architecture — retrieval architecture.
concepts/multi-task-learning — training paradigm; this source is the unified-single-head MTL instance.
concepts/auxiliary-task-regularization · concepts/offsite-conversion-sparsity · concepts/click-duration-reweighting · concepts/advertiser-level-loss · concepts/parallel-cross-and-deep-network · concepts/ad-impression-as-hard-negative · concepts/shopping-conversion-candidate-generation
patterns/parallel-dcn-mlp-cross-layers · patterns/dual-positive-signal-for-sparse-labels · patterns/unified-multi-task-over-multi-head · patterns/auxiliary-engagement-task-for-conversion-retrieval
sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces — engagement-model sibling; both use shared trunk + multi-task heads + surface awareness.
sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — L1 CVR diagnosis methodology sibling for conversion-focused retrieval at the next-stage-up boundary.