SYSTEM Cited by 1 source

Pinterest Contextual Sequential Two-Tower CG¶

Definition¶

Pinterest's Contextual Sequential Two-Tower Model is a retrieval-stage two-tower candidate generator (CG) for Pinterest ads that fuses real-time on-Pinterest context (the subject Pin the user is currently viewing) with cached offline historical user-behaviour state at ad-request time. It evolves from Pinterest's prior offsite-conversion-history Transformer-based CG (systems/pinterest-sequential-cg) — the precursor inferred user embeddings purely offline, lacking real-time intent. The contextual evolution adds a context layer inside the user tower, a synthetic pseudo-context training scheme, and a hybrid offline/online serving flow (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models).

Initially deployed on Related Pins; the post names Search as the next surface-expansion target.

Problem framing¶

The baseline Pinterest Sequential CG used a Transformer-based two-tower model where user embeddings were "inferred offline purely from historical offsite behavior." On contextual surfaces — Related Pins (where the subject Pin is a strong intent signal) and Search (where the query is) — this missing-context shape produced a survival-rate collapse:

"Less than 1% of impressions on Related Pins were attributed to this CG, indicating its candidates struggled to survive the downstream ranking and auction stages."

The CG was retrieving candidates the funnel kept dropping — the diagnostic was survival rate, not recall. Real-time context was the structural fix.

Architecture¶

Two-tower with context-extended user tower¶

                              ┌──── Transformer encoder ────┐    ← cached offline daily
User offsite history  ──────► │  (offsite conversion seq)   │       in feature store
                              └─────────────┬───────────────┘
                                            │  last hidden state
                                            │
                                            ▼          (online at ad-request time)
                              ┌────── concatenate ─────┐
                              │                        │
Subject-Pin interest    ────► [ context layer ]        │
categories (weighted by       │ (high dropout in       │
 confidence)                  │  training)             │
+ user demographics           └────────────┬───────────┘
                                           │
                                           ▼
                                   [ final MLP head ]
                                           │
                                    user embedding
                                           │
                                           └──────────────┐
                                                          ▼
                                               dot product → score
                                                          ▲
                                          pin embedding ◄─┘   (item tower via ANN index)

Concatenation, not cross-attention. The combined representation = concat(transformer_output, context_layer_output). The post explicitly proposes cross-attention as future work but the shipped design uses concatenation.
Context layer input (Related Pins): aggregated embedding representations of the top interest categories of the subject Pin, weighted by their confidence scores.
User representation augmentation: demographic features (age, country, gender) added to the user tower for personalisation.

Training with synthetic pseudo-context¶

Real-time context exists only at serving time; logged offsite-conversion training data has no associated "current Pin." Pinterest's training-serving-parity hack:

"During model training, we artificially inject pseudo-context information derived from the positive label (the conversion event) into the input sequence. For example, by projecting the interest category features from the positive item, we encourage the model to retrieve items that are semantically related to the context associated with that user session."

Training time:
  positive label → project interest categories → pseudo-context features → context layer

Serving time:
  subject Pin → interest categories × confidence → real context features → context layer

The shared input shape — interest-category embeddings — makes pseudo-context substitutable for real context. See patterns/synthetic-pseudo-context-from-label.

High dropout in the context layer during training prevents the model from over-relying on the (label-leaked) pseudo-context and abandoning the historical-sequence signal. "A high dropout rate is used in the context layer during training to ensure the model still relies on the user's historical event sequence (the Transformer output)." See patterns/high-dropout-on-augmented-feature-layer.

Why not real on-Pinterest context training data?¶

Pinterest considered using real onsite context (subject Pins viewed during Related Pins ad impressions) merged with offsite history but rejected it:

"(1) Merging onsite data with offsite data presents significant technical difficulties. (2) We cannot guarantee that a user has viewed ad impressions on Related Pins between two sequential offsite events."

The real-context training data simply doesn't exist densely enough for the offsite-conversion-history sequence model. Synthetic context is the workaround.

Hybrid offline/online user tower inference¶

The cost-heavy Transformer encoder is far too expensive to run online; the context layer must run online because its input is the live request. The user tower is architecturally split at this boundary (patterns/hybrid-offline-online-user-tower-inference):

Offline (daily batch, per user with new offsite activity)
  ──────────────────────────────────────────────────────
  user offsite history → Transformer encoder → last hidden state → feature store

Online (per ad request)
  ──────────────────────────────
  feature store lookup ──┐
                         ├──► context layer ──► final MLP head ──► user embedding
  subject Pin features ──┘
  user demographics  ────┘
                         (live request)

The cached offline state has a daily freshness ceiling — offsite activity within the last ~24 hours is invisible to the served embedding until the next batch.

Production results¶

(Pinterest Internal Data, Related Pins surface; offline evaluation on logged real-traffic ad data.)

Offline evaluation¶

3x–10x increase in Recall@K vs the production model. "Here the candidates that survived the ranking funnel and delivered to the users were considered positive items."

Survival rate & relevance¶

~275–300% lift in median candidate relevance.
+1.08% ads relevance metric on Related Pins overall.
2x more candidates retrieved being delivered to impression.

Topline business metrics¶

~0.7% measurable lift in ROAS (Return on Ad Spend).
~1.4% ROAS lift in top countries (the majority of total revenue).

Relationship to Pinterest's other ads-ML systems¶

systems/pinterest-sequential-cg — direct predecessor. Same Transformer-encoder-over-offsite-conversion-history substrate; this system adds context layer, synthetic augmentation, hybrid inference. Both are CGs feeding into the same downstream ranking funnel.
systems/pinterest-shopping-conversion-cg — sibling Pinterest ads CG with a different lineage (shopping ads, parallel DCNv2+MLP cross layers, multi-task with engagement auxiliary, advertiser-level loss). Both are retrieval-stage two-tower models targeting offsite conversions; they coexist as separate candidate pools feeding the L1/L2 ranking funnel. Some shared concepts (concepts/offsite-conversion-sparsity, concepts/two-tower-architecture) but distinct architectural primitives.
systems/pinterest-ads-engagement-model — the ranking-stage unified multi-surface model the CG candidates feed into. Engagement model decides which retrieved candidates survive to auction; the CG's job ends at retrieval.
systems/transformer — the offsite-conversion-history encoder. In this system the Transformer's role is cached offline encoder, not live-inference, distinguishing it from the engagement-model use where Transformer outputs are projected and fed to surface-specific tower trees online.

Caveats¶

No published architecture diagrams — Pinterest names two figures (the contextual sequential two-tower architecture, the synthetic-augmented-data pipeline) but they're not in the ingested markdown.
Topology, hyperparameters undisclosed — Transformer layer/head/hidden-dim count, sequence length, context-layer dimensions, MLP head dimensions, dropout rate, embedding dimensions, batch size all unspecified.
Daily-refresh staleness impact unquantified — Pinterest doesn't disclose the candidate-quality cost of stale-by-up-to-24-hour offsite history.
Pseudo-context projection function unspecified — "projecting the interest category features" doesn't name sum/mean/attention/learned projection.
Latency / compute envelope undisclosed — no p50/p99 of the online tower portion, no comparison of cached-state size vs prior fully-offline embedding size, no per-request compute footprint.
Future cross-attention fusion not yet shipped — proposed but not validated in this post.
Survival rate ceiling. 2x candidate delivery + 1.08% relevance lift + 0.7% ROAS suggests the CG started from a low absolute floor on Related Pins; the gains are real but the absolute headroom is bounded by competing CGs in the funnel.
Search expansion is future work — the context layer's input shape changes (search query embedding instead of subject-Pin interest categories) but the architectural pattern transfers.

Seen in¶

2026-05-08 Pinterest — Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models) — canonical wiki instance. Names the context-layer + synthetic-pseudo-context + hybrid-inference triplet, and the survival-rate-as-CG-diagnostic framing.