CONCEPT Cited by 1 source
Online-offline discrepancy¶
Definition¶
Online-offline discrepancy (O/O discrepancy) is the named production hazard where a change to an ML model shows clear offline wins on standard evaluation metrics (loss, calibration, AUC, LogMAE, etc.) but does not translate — or is neutral / negative — when the same change is A/B-tested online against the business metric it was meant to move (CPA, conversion rate, engagement, revenue).
Pinterest's Ads ML team introduced the O/O terminology in their 2026-02-27 retrospective on L1 CVR models (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr). They saw "20–45% LogMAE reduction vs the production model across multiple log sources" offline alongside "neutral or slightly worse CPA for key oCPM segments" online.
Why it happens¶
O/O is not a single bug but a class of bugs with multiple structural causes. Pinterest's framework (three-layer O/O diagnosis) organizes them into three layers:
Layer 1 — model + evaluation¶
- Sampling bias in the offline eval dataset — evaluated only on easy segments, or on a log-source mix that doesn't match the production request shape.
- Label leakage — the offline label is partially predictable from the feature set in a way that won't replicate online.
- Outlier domination — gains driven by a small number of large-loss samples that don't correspond to business impact.
- Eval-dataset construction — regenerated datasets with different log-source mixes producing inconsistent results.
Pinterest ruled these out by re-computing loss + calibration on three log sources (auction-winner, full-request, partial-request), breaking down by pCVR percentile, and re-evaluating both models on identical data.
Layer 2 — serving + features (where Pinterest found the bugs)¶
- Feature parity gap — features present in training logs + L2 Feature Store but absent from a serving artifact like an ANN index used by retrieval / L1 ranking. The model learns to use them offline; online they're zeros.
- Embedding version skew — in two-tower systems, query + item towers produce embeddings from different model checkpoints because index build + deploy cycles take longer than model rollout cycles.
- Serving-path inference differences — quantization, batch size, FP16 vs FP32, different library versions between the training framework and the online serving stack.
- Model-versioning bugs — production actually serving a different checkpoint than the one evaluated.
Layer 3 — funnel + utility¶
- Funnel-recall saturation — L1 improves its own metric, but the funnel's retrieval / ranking recall is already near its ceiling, so no additional good candidates reach L2 (retrieval → ranking funnel).
- Metric mismatch — offline metric (LogMAE, KL, calibration) and online metric (CPA, CTR, revenue) measure different things. Offline is necessary, not sufficient.
- Bid / pacing / auction filtering — the auction re-shapes what candidates get impressions; even materially better ranking can be erased by bid + budget interactions.
Common ruled-out candidates¶
Hypotheses that are often considered but often not the main cause:
- Exposure bias — the suspicion that control-dominant traffic biases online metrics against small treatments. Testable by ramping treatment share (Pinterest went ~20% → ~70%) and watching for metric improvement as share grows. In Pinterest's case the over-calibration issue persisted at higher shares — exposure bias was not the cause.
- Timeouts + serving failures — worse tail latency could degrade treatment. Testable by comparing p50/p90/p99 + success rate across control + treatment per tower. Pinterest saw no materially worse behavior.
- Traffic share too small for significance — usually orthogonal to O/O; amplitude of the offline-online gap shouldn't shrink as you add power.
How to diagnose¶
Pinterest's three-layer framework: for each layer's hypotheses, ask "could this alone explain the gap?" and use data to accept or reject. Key disciplines:
- Don't stop at one cause. Even after ruling out layer-1 issues, layers 2 + 3 usually both contribute.
- Instrument serving-path coverage. Feature coverage dashboards make layer-2 feature-parity gaps visible at all.
- Run controlled sweeps for layer-2 hazards. For embedding version skew: fix one tower, vary the other, measure loss / calibration degradation.
- Track funnel recall, not just model metric. Retrieval recall + ranking recall of the new L1 output against ground-truth auction winners + downstream utility are the real ceiling.
How to prevent¶
Pinterest's closing frame: "O/O discrepancy is not something you debug at the end; it's something you design for from the start."
- Treat model + embeddings + feature pipelines as one system. A feature that exists in training logs isn't a feature you have; you only have features that exist in the serving artifact.
- Make debuggability part of the product. Coverage dashboards, embedding-skew tests, parity harnesses are first-class infrastructure, not side-investments.
- Gate model readiness on skew sensitivity. Skew sweeps as a standard readiness check.
- Change tooling defaults to close bug classes. Pinterest changed UFR's default so that "features onboarded for L2 are automatically considered for L1 embedding usage" — a single default change closed a recurring parity-gap bug.
Relation to other skew concepts¶
- Training / serving boundary — the organizational / infrastructure split. O/O discrepancy is often a symptom of a poorly-bridged training-serving boundary, specifically around features + model artifacts.
- Embedding version skew — a specific layer-2 cause of O/O in two-tower systems.
- Feature-store parity — the broader principle that the training-time feature view must equal the serving-time feature view.
Seen in¶
- sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — canonical wiki instance. 20–45% offline LogMAE reduction vs neutral/negative CPA online; three-layer diagnosis; exposure bias + timeouts + offline-eval bugs ruled out; feature parity gap + embedding version skew found + fixed; closing "design for O/O from the start" shift.
Related¶
- systems/pinterest-l1-ranking
- concepts/training-serving-boundary
- concepts/two-tower-architecture
- concepts/embedding-version-skew
- concepts/feature-coverage-dashboard
- concepts/retrieval-ranking-funnel
- concepts/exposure-bias-ml
- patterns/three-layer-oo-diagnosis
- patterns/feature-parity-audit
- patterns/version-skew-sensitivity-check
- patterns/batch-embedding-for-index-consistency