Skip to content

CONCEPT Cited by 1 source

Online-offline discrepancy

Definition

Online-offline discrepancy (O/O discrepancy) is the named production hazard where a change to an ML model shows clear offline wins on standard evaluation metrics (loss, calibration, AUC, LogMAE, etc.) but does not translate — or is neutral / negative — when the same change is A/B-tested online against the business metric it was meant to move (CPA, conversion rate, engagement, revenue).

Pinterest's Ads ML team introduced the O/O terminology in their 2026-02-27 retrospective on L1 CVR models (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr). They saw "20–45% LogMAE reduction vs the production model across multiple log sources" offline alongside "neutral or slightly worse CPA for key oCPM segments" online.

Why it happens

O/O is not a single bug but a class of bugs with multiple structural causes. Pinterest's framework (three-layer O/O diagnosis) organizes them into three layers:

Layer 1 — model + evaluation

  • Sampling bias in the offline eval dataset — evaluated only on easy segments, or on a log-source mix that doesn't match the production request shape.
  • Label leakage — the offline label is partially predictable from the feature set in a way that won't replicate online.
  • Outlier domination — gains driven by a small number of large-loss samples that don't correspond to business impact.
  • Eval-dataset construction — regenerated datasets with different log-source mixes producing inconsistent results.

Pinterest ruled these out by re-computing loss + calibration on three log sources (auction-winner, full-request, partial-request), breaking down by pCVR percentile, and re-evaluating both models on identical data.

Layer 2 — serving + features (where Pinterest found the bugs)

  • Feature parity gap — features present in training logs + L2 Feature Store but absent from a serving artifact like an ANN index used by retrieval / L1 ranking. The model learns to use them offline; online they're zeros.
  • Embedding version skew — in two-tower systems, query + item towers produce embeddings from different model checkpoints because index build + deploy cycles take longer than model rollout cycles.
  • Serving-path inference differences — quantization, batch size, FP16 vs FP32, different library versions between the training framework and the online serving stack.
  • Model-versioning bugs — production actually serving a different checkpoint than the one evaluated.

Layer 3 — funnel + utility

  • Funnel-recall saturation — L1 improves its own metric, but the funnel's retrieval / ranking recall is already near its ceiling, so no additional good candidates reach L2 (retrieval → ranking funnel).
  • Metric mismatch — offline metric (LogMAE, KL, calibration) and online metric (CPA, CTR, revenue) measure different things. Offline is necessary, not sufficient.
  • Bid / pacing / auction filtering — the auction re-shapes what candidates get impressions; even materially better ranking can be erased by bid + budget interactions.

Common ruled-out candidates

Hypotheses that are often considered but often not the main cause:

  • Exposure bias — the suspicion that control-dominant traffic biases online metrics against small treatments. Testable by ramping treatment share (Pinterest went ~20% → ~70%) and watching for metric improvement as share grows. In Pinterest's case the over-calibration issue persisted at higher shares — exposure bias was not the cause.
  • Timeouts + serving failures — worse tail latency could degrade treatment. Testable by comparing p50/p90/p99 + success rate across control + treatment per tower. Pinterest saw no materially worse behavior.
  • Traffic share too small for significance — usually orthogonal to O/O; amplitude of the offline-online gap shouldn't shrink as you add power.

How to diagnose

Pinterest's three-layer framework: for each layer's hypotheses, ask "could this alone explain the gap?" and use data to accept or reject. Key disciplines:

  • Don't stop at one cause. Even after ruling out layer-1 issues, layers 2 + 3 usually both contribute.
  • Instrument serving-path coverage. Feature coverage dashboards make layer-2 feature-parity gaps visible at all.
  • Run controlled sweeps for layer-2 hazards. For embedding version skew: fix one tower, vary the other, measure loss / calibration degradation.
  • Track funnel recall, not just model metric. Retrieval recall + ranking recall of the new L1 output against ground-truth auction winners + downstream utility are the real ceiling.

How to prevent

Pinterest's closing frame: "O/O discrepancy is not something you debug at the end; it's something you design for from the start."

  • Treat model + embeddings + feature pipelines as one system. A feature that exists in training logs isn't a feature you have; you only have features that exist in the serving artifact.
  • Make debuggability part of the product. Coverage dashboards, embedding-skew tests, parity harnesses are first-class infrastructure, not side-investments.
  • Gate model readiness on skew sensitivity. Skew sweeps as a standard readiness check.
  • Change tooling defaults to close bug classes. Pinterest changed UFR's default so that "features onboarded for L2 are automatically considered for L1 embedding usage" — a single default change closed a recurring parity-gap bug.

Relation to other skew concepts

  • Training / serving boundary — the organizational / infrastructure split. O/O discrepancy is often a symptom of a poorly-bridged training-serving boundary, specifically around features + model artifacts.
  • Embedding version skew — a specific layer-2 cause of O/O in two-tower systems.
  • Feature-store parity — the broader principle that the training-time feature view must equal the serving-time feature view.

Seen in

Last updated · 319 distilled / 1,201 read