PINTEREST Tier 2

Pinterest — Bridging the Gap: Diagnosing Online-Offline Discrepancy in Pinterest's L1 Conversion Models¶

Pinterest's Ads ML team (Yao Cheng, Qingmengting Wang, Yuanlu Bai, Yuan Wang, Zhaohong Han, Jinfeng Zhuang) publish a production-ranking retrospective (2026-02-27) on why a new L1 CVR (conversion) model at Pinterest kept showing offline wins that did not translate to online A/B wins — a pattern the team names Online–Offline (O/O) discrepancy. The value of the post to the wiki is not the ML objective (log-loss + calibration on pCVR) but the full-stack debugging framework they used to diagnose O/O, and the two concrete production causes that turned out to matter: training-vs-serving feature discrepancy and two-tower embedding-version skew. Both are generic production-serving hazards, not Pinterest-specific.

Summary¶

Pinterest's ads funnel has L1 ranking sitting in the middle: retrieval → L1 → L2 ranking → auction. L1 must filter + prioritize candidates under tight latency, so that downstream L2 + the auction see a manageable set of ads. When Pinterest pushed new L1 CVR (conversion-rate) models, they saw a repeatable pattern:

Offline: 20–45% LogMAE reduction vs production across multiple log sources, better calibration in every pCVR bucket.
Online: neutral or slightly worse CPA (cost-per-acquisition — the primary business metric for oCPM advertisers) despite the offline gains, plus unexplained mix-shifts (more oCPM impressions than expected) that didn't match the offline story.

The team treated this as a full-stack diagnosis rather than a hunt for one bug, and organized hypotheses into three layers:

Model & evaluation. Are the offline metrics themselves trustworthy? (Sampling bias, labels, outliers, eval-dataset construction.)
Serving & features. Is production actually seeing the same features + model the offline eval did? (Feature coverage, embedding-build pipeline, model versioning, inference path.)
Funnel & utility. Even if predictions are correct, can the funnel or the auction erase the gain? (Retrieval vs ranking recall, stage misalignment, offline-vs-online metric mismatch.)

For each layer they asked "could this alone explain the gap?" and used data to accept or reject. Three layer-1 hypotheses were ruled out as sufficient explanations:

Offline evaluation bugs. Re-computed loss + calibration across three log sources (auction-winner samples, full-request auction-candidate samples, partial-request auction-candidate samples); broken down by pCVR percentiles; re-evaluated production + treatment models on identical data with regenerated datasets. The experimental model still beat production on log-loss everywhere, across every percentile bucket, even after outlier handling. Offline eval bugs alone could not explain neutral online.
Exposure bias. Ramped treatment traffic share from ~20% up to ~70% and monitored online calibration + loss before and after. If exposure bias drove the gap, metrics should improve as treatment owns more traffic. "We did not see that pattern; the over-calibration issue persisted even at higher treatment shares."
Timeouts + serving failures. Compared success rate + p50/p90/p99 latency across control and treatment for both query + Pin towers. No materially worse timeout or tail-latency behavior for treatment. Consistent with prior L1 investigations on engagement models.

The actual causes sat in layer 2 (serving + features):

1. Feature O/O discrepancy — features existed in training, absent at serving¶

L1 Pin embeddings at Pinterest are built from indexing snapshots and fed into an ANN index used for retrieval + L1 ranking. Critically, this pipeline is separate from the L2 Feature Store used downstream. So:

Offline: the model was trained + evaluated on rich logged features including detailed advertiser and Pin-promotion signals.
Online: the embedding builder only saw the subset of features that had been explicitly onboarded into the L1 embedding.

Putting offline feature-insertion tables next to online feature-coverage dashboards, the team found several high-impact Pin feature families had never made it into the L1 embedding path:

Targeting-spec flags (interest targeting, search-term modes, auto-targeting).
Offsite conversion visit counts at 1 / 7 / 30 / 90 days.
Annotations + MediaSage image embeddings.

These signals existed in training logs, so the model (reasonably) learned to lean on them. But at serving time they were absent from the embeddings, meaning for many oCPM and performance-sensitive ads the online model was running on a much thinner feature set than the one it had been evaluated on offline.

Fix: update UFR (Pinterest's feature registry) configs to onboard the missing features into L1 embedding usage, and watch online feature-coverage dashboards confirm the gap closing. Online loss moved in the right direction for both CVR and engagement models, especially on shopping traffic. Pinterest also changed default UFR behavior so that features onboarded for L2 are automatically considered for L1 embedding usage — closing a recurring class of silent O/O bugs at the tooling level.

The lesson Pinterest flags is domain-independent: "It's not enough for features to exist in training logs or the Feature Store — they also need to be present in the serving artifacts (like ANN indices) that L1 actually uses to serve traffic."

2. Embedding version skew — query tower vs Pin tower on different checkpoints¶

The second cause is specific to two-tower architectures (one tower encodes the user / query, a second encodes the item / Pin, and a dot-product gives a score). Even when features are correct, the two towers may be producing embeddings from different model checkpoints.

Offline: clean single-checkpoint setup — one fixed model version for both towers, consistent features, deterministic batch inference.
Online: realtime enrichment writes fresh Pin embeddings into hourly indexing snapshots, query models roll on their own schedule, and for large tiers the index build + deploy cycle can span days. Multiple embedding versions coexist in the same retrieval index.

The result is natural version skew: dot products computed between a query from version X and Pins whose embeddings come from X, X−1, X−2 and so on. To quantify it, Pinterest ran controlled sweeps:

Fix the query-tower version.
Vary the Pin embedding version across a realistic range.
Measure loss + calibration across tiers + log sources.

Findings:

For simpler, more stable model families: skew caused some degradation but not enough to fully explain online behavior.
For more complex variants (named: DHEN): the same skew led to noticeably worse loss on some slices — large enough to materially drag down online performance vs the offline idealized case.

Pinterest's response is not to try to eliminate skew (hard in a live system) but to treat skew as a deployment constraint:

For large tiers, prefer batch embedding inference so each ANN build uses a single consistent embedding version (patterns/batch-embedding-for-index-consistency).
Require every new model family to pass explicit version-skew sensitivity checks as part of model readiness (patterns/version-skew-sensitivity-check).

"Offline numbers came from a cleaner world than the one the model actually lived in online."

3. Funnel + metric — even correct predictions can fail to move CPA¶

Fixing features + skew closed most of "what we thought we were serving vs what was running." The team then addressed the systemic question: if predictions are fine, can the rest of the system translate them into CPA wins?

Funnel alignment. The ads funnel has multiple stages under different constraints. L1 can be strictly better on its own metrics and still not move the overall system if the rest of the funnel is near its ceiling. Pinterest tracked:

Retrieval recall: among final auction winners, how many came from the L1 output set?
Ranking recall: among the top-K candidates by downstream utility, how many appeared in the L1 output set?

Observed: cases where offline L1 metrics improved but retrieval/ranking recall did not improve end-to-end, particularly on surfaces already near their recall ceilings. Among several treatment arms with strong offline gains, only the ones where recall actually moved produced clear online wins. "Beyond a certain point, L1 model quality is not the bottleneck — the funnel and utility design are."

Metric mismatch. Offline and online metrics live in different regimes:

Offline: LogMAE, KL-divergence, calibration, often with L2 predictions as teacher labels.
Online: CPA (the primary conversion business metric), shaped by bids, budgets, pacing, auction logic.

Replay analyses showed it is possible to deliver more or better candidates (by downstream utility) and still not see the expected CPA movement once everything filters through real-world auction behavior. Offline metrics are "necessary, not sufficient" — they have to be interpreted through the funnel and utility context they will live in.

The closing shift — O/O as a design constraint, not a bug¶

The team's final framing: "O/O discrepancy is not something you debug at the end; it's something you design for from the start." Three disciplines:

Treat model + embeddings + feature pipelines as one system. Trust offline wins only after verifying the serving stack sees the same world the model was trained in.
The funnel sets the ceiling. Once recall and utility are saturated or misaligned, better L1 predictions alone won't move CPA.
Debuggability is part of the product. Coverage dashboards, embedding-skew tests, and parity harnesses are as important to model velocity as the architecture itself.

Key takeaways¶

"Online–Offline (O/O) discrepancy" is the named production hazard. A new L1 CVR model showed 20–45% LogMAE reduction vs production on shared eval across three log sources + every pCVR bucket, but Budget-Split online A/Bs delivered "neutral or slightly worse CPA for key oCPM segments" plus unexplained mix-shifts. Pinterest gives this gap a name, then engineers around it — canonical wiki instance of online-offline discrepancy. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Three-layer full-stack diagnosis framework — don't hunt for one bug. "Instead of trying to guess a single root cause, we treated this as a full-stack diagnosis and organized our hypotheses into three layers: (1) Model & evaluation … (2) Serving & features … (3) Funnel & utility. For each bucket of hypotheses we asked: 'Could this alone explain the O/O gap we see?' Then we used data to accept or reject it." Canonical wiki statement of the three-layer O/O diagnosis pattern — generalizes beyond ads ranking to any ML-serving O/O hunt. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Exposure-bias + timeouts + offline-eval bugs were all ruled out as sufficient explanations — not dismissed but tested. Exposure bias: ramped treatment traffic share ~20% → ~70% and monitored online calibration / loss. "If exposure bias were the main issue, we would expect treatment metrics to improve as it owned more traffic. We did not see that pattern." Timeouts: p50/p90/p99 latency + success-rate comparisons for both query + Pin towers showed no materially worse tail for treatment. Offline eval: re-computed loss + calibration across three log-source mixes, pCVR-percentile breakdowns, regenerated datasets — experimental model still won everywhere. Template for how to cleanly eliminate common O/O explanations with data. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Feature O/O discrepancy — training sees more features than serving. L1 Pin embeddings are built from indexing snapshots and fed into an ANN index — a pipeline separate from the L2 Feature Store. Offline training used rich logged features; online embeddings only included features explicitly onboarded into L1 embedding usage. Cross-referencing offline feature-insertion tables against online feature-coverage dashboards surfaced entire feature families missing from the online path: targeting-spec flags (interest targeting, search-term modes, auto-targeting), offsite conversion visit counts (1/7/30/90-day), annotations + MediaSage image embeddings. Canonical wiki example of feature-parity audit and a concrete failure mode of training / serving boundary crossings. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Fix pattern: onboard missing features + change UFR default. Updated UFR (feature registry) configs to include the missing features in L1 embeddings; watched coverage recover on online dashboards + online loss move in the right direction for CVR + engagement models "especially on shopping traffic." Then closed the class-of-bug at the tooling level: changed UFR default so that features onboarded for L2 are automatically considered for L1 embedding usage. Canonical lesson: "It's not enough for features to exist in training logs or the Feature Store — they also need to be present in the serving artifacts (like ANN indices) that L1 actually uses to serve traffic." (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Embedding version skew — query tower and Pin tower on different checkpoints. Offline evaluation runs with one fixed model version for both towers; online, realtime enrichment writes fresh Pin embeddings into hourly indexing snapshots, query models roll on their own schedule, and index build + deploy cycles can span days on large tiers — so multiple embedding versions coexist in the same retrieval index. Dot products end up computed between a query from version X and Pins from X, X−1, X−2, … Canonical wiki statement of embedding version skew in two-tower retrieval / ranking systems. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Skew sensitivity varies by model family — DHEN was worse than simpler variants. Controlled sweeps (fix query-tower version, vary Pin-embedding version across a realistic range, measure loss / calibration across tiers + log sources) found: "For simpler, more stable model families, this skew caused some degradation but not enough to fully explain the online behavior. For more complex variants (like DHEN), the same level of skew led to noticeably worse loss on some slices — large enough to materially drag down online performance." Model-architecture choice is also a skew-sensitivity choice. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Don't try to eliminate skew — design around it. "Instead of trying to completely eliminate skew (which is hard in a live system), we started treating it as a deployment constraint: for large tiers we favor batch embedding inference so each ANN build uses a single, consistent embedding version, and we require every new model family to go through explicit version-skew sensitivity checks as part of model readiness." Two reusable patterns: patterns/batch-embedding-for-index-consistency (batch inference so each ANN build is version-consistent) and patterns/version-skew-sensitivity-check (skew-sweep as model-readiness gate). Explicit acknowledgment that "offline numbers came from a cleaner world than the one the model actually lived in online." (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
The funnel sets the ceiling — retrieval + ranking recall are the real constraints. Pinterest tracked two retrieval-ranking funnel recall metrics: retrieval recall (among final auction winners, how many came from L1 output?) and ranking recall (among top-K candidates by downstream utility, how many appeared in L1 output?). Observed cases where offline L1 metrics improved but end-to-end recall did not improve, especially on surfaces already near their recall ceilings. Among several treatment arms with strong offline gains, "only one or two produced clear online wins, which matched where recall actually moved." Generalized lesson: "beyond a certain point, L1 model quality is not the bottleneck — the funnel and utility design are." (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Metric mismatch — LogMAE ≠ CPA. Offline metrics (LogMAE, KL, calibration, often with L2 predictions as teacher labels) and online metrics (CPA, shaped by bids / budgets / pacing / auction logic) live in different regimes. Replay analyses showed you can deliver more or better candidates by downstream utility and still not see the CPA movement you'd expect once everything is filtered through real-world auction behavior. "Offline metrics are necessary, not sufficient. You need to interpret them through the funnel and utility context they're going to live in." (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)
Closing shift: O/O is a design constraint, not a final-step bug. "O/O discrepancy is not something you debug at the end; it's something you design for from the start." Three disciplines: treat model + embeddings + feature pipelines as one system; accept that the funnel sets the ceiling; treat debuggability (coverage dashboards + skew tests + parity harnesses) as part of the product, not a side-investment. (Source: sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr)

Architectural numbers¶

Datum	Value	Scope
Offline LogMAE reduction (new CVR model vs production)	20–45%	Across multiple log sources
Treatment traffic ramp for exposure-bias test	~20% → ~70%	Budget-Split experiment
Missing feature windows (offsite conversion visits)	1 / 7 / 30 / 90 days	Per-Pin feature family
Index build + deploy cycle (large tiers)	"can span days"	Source of embedding version skew
Pin embedding snapshot cadence	hourly	Realtime enrichment → indexing snapshots
Publication date	2026-02-27	Pinterest Engineering (Medium)

No absolute CPA deltas, no absolute feature-coverage percentages, no absolute recall numbers, no sweep ranges for version skew, no DHEN-vs-simple loss differential disclosed. The post is a methodology retrospective, not a quantitative impact report.

Systems introduced¶

systems/pinterest-l1-ranking — Pinterest's L1 ranking stage in the ads funnel. Sits between retrieval and L2 ranking; runs under tight latency; filters + prioritizes candidates so downstream L2 + auction see a manageable set. Uses a two-tower model with Pin embeddings served from an ANN index built on indexing snapshots.
systems/pinterest-ufr — Pinterest's Unified Feature Registry. Configures which features are onboarded into the L2 Feature Store and which feature families are available to the L1 embedding-build path. The post's fix included changing UFR's default so that features onboarded for L2 are automatically considered for L1 embedding usage.

Concepts introduced¶

concepts/online-offline-discrepancy — the named gap between offline ML-metric wins and online A/B / business-metric wins; Pinterest's canonical treatment.
concepts/two-tower-architecture — the query-tower + Pin-tower retrieval / ranking architecture whose embedding-version-independence assumption breaks at Pinterest scale.
concepts/embedding-version-skew — the production failure mode where query + item towers run on different model checkpoints; materially degrades complex model families (DHEN) more than simple ones.
concepts/ann-index — approximate-nearest-neighbor index; the serving artifact for L1 Pin embeddings; distinct from the L2 Feature Store and a common site of feature-parity gaps.
concepts/exposure-bias-ml — the hypothesis (ruled out here) that control-dominant traffic biases metrics against small treatments; Pinterest's ramp-test methodology.
concepts/feature-coverage-dashboard — the online operational primitive that made the feature O/O gap visible.

Patterns introduced¶

patterns/three-layer-oo-diagnosis — structure O/O hypotheses into model-eval / serving-features / funnel-utility layers; test each for "could this alone explain the gap?"
patterns/feature-parity-audit — cross-reference offline feature-insertion tables against online feature-coverage dashboards; find families missing from the serving path.
patterns/version-skew-sensitivity-check — fix one tower version, vary the other across a realistic range, measure loss / calibration degradation; gate model-readiness on acceptable skew sensitivity.
patterns/batch-embedding-for-index-consistency — for large tiers, prefer batch embedding inference so each ANN build uses a single consistent embedding version, instead of incremental multi-version snapshots.

Caveats¶

Announcement-retrospective voice with many numbers elided: no absolute CPA deltas, no absolute feature-coverage percentages before/after fix, no named numerical skew thresholds, no named DHEN-vs-simple-model loss delta, no traffic-slice breakdowns beyond "especially on shopping traffic" for the feature fix.
Pinterest-internal systems under-documented: UFR (feature registry) behavior, MediaSage image embeddings, DHEN architecture, L2 feature-store shape, and the specific L1 embedding-build pipeline internals are all referenced but not described. Where this source page uses bracket-labeled names like UFR, those are stub pages until Pinterest publishes deeper posts.
DHEN architecture not explained. Named as a "more complex variant" with worse skew sensitivity; no citation or architectural detail.
Single-company signal — the post is Pinterest's experience only; comparable data from Meta / Google / TikTok ads is not present.
No decoder / auction detail — CPA shaping by bids / budgets / pacing / auction logic is asserted but not decomposed; the "metric mismatch" section argues that this matters without demonstrating how much.
"Recall was the bottleneck" is stated qualitatively; no specific recall-ceiling numbers are disclosed.
Forward-work implied, not named — "bake these ideas into our launch process" is the closing direction; the concrete checklist / dashboard / harness is not shared in this post.

Source¶

Companies: companies/pinterest
Systems: systems/pinterest-l1-ranking, systems/pinterest-ufr
Concepts: concepts/online-offline-discrepancy, concepts/two-tower-architecture, concepts/embedding-version-skew, concepts/ann-index, concepts/exposure-bias-ml, concepts/feature-coverage-dashboard, concepts/training-serving-boundary, concepts/retrieval-ranking-funnel, concepts/feature-store
Patterns: patterns/three-layer-oo-diagnosis, patterns/feature-parity-audit, patterns/version-skew-sensitivity-check, patterns/batch-embedding-for-index-consistency