CONCEPT Cited by 2 sources

Online-offline discrepancy¶

Definition¶

Online-offline discrepancy (O/O discrepancy) is the named production hazard where a change to an ML model shows clear offline wins on standard evaluation metrics (loss, calibration, AUC, LogMAE, etc.) but does not translate — or is neutral / negative — when the same change is A/B-tested online against the business metric it was meant to move (CPA, conversion rate, engagement, revenue).

Pinterest's Ads ML team introduced the O/O terminology in their 2026-02-27 retrospective on L1 CVR models (sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr). They saw "20–45% LogMAE reduction vs the production model across multiple log sources" offline alongside "neutral or slightly worse CPA for key oCPM segments" online.

Why it happens¶

O/O is not a single bug but a class of bugs with multiple structural causes. Pinterest's framework (three-layer O/O diagnosis) organizes them into three layers:

Layer 1 — model + evaluation¶

Sampling bias in the offline eval dataset — evaluated only on easy segments, or on a log-source mix that doesn't match the production request shape.
Label leakage — the offline label is partially predictable from the feature set in a way that won't replicate online.
Outlier domination — gains driven by a small number of large-loss samples that don't correspond to business impact.
Eval-dataset construction — regenerated datasets with different log-source mixes producing inconsistent results.

Pinterest ruled these out by re-computing loss + calibration on three log sources (auction-winner, full-request, partial-request), breaking down by pCVR percentile, and re-evaluating both models on identical data.

Layer 2 — serving + features (where Pinterest found the bugs)¶

Feature parity gap — features present in training logs + L2 Feature Store but absent from a serving artifact like an ANN index used by retrieval / L1 ranking. The model learns to use them offline; online they're zeros.
Embedding version skew — in two-tower systems, query + item towers produce embeddings from different model checkpoints because index build + deploy cycles take longer than model rollout cycles.
Serving-path inference differences — quantization, batch size, FP16 vs FP32, different library versions between the training framework and the online serving stack.
Model-versioning bugs — production actually serving a different checkpoint than the one evaluated.

Layer 3 — funnel + utility¶

Funnel-recall saturation — L1 improves its own metric, but the funnel's retrieval / ranking recall is already near its ceiling, so no additional good candidates reach L2 (retrieval → ranking funnel).
Metric mismatch — offline metric (LogMAE, KL, calibration) and online metric (CPA, CTR, revenue) measure different things. Offline is necessary, not sufficient.
Bid / pacing / auction filtering — the auction re-shapes what candidates get impressions; even materially better ranking can be erased by bid + budget interactions.

Common ruled-out candidates¶

Hypotheses that are often considered but often not the main cause:

Exposure bias — the suspicion that control-dominant traffic biases online metrics against small treatments. Testable by ramping treatment share (Pinterest went ~20% → ~70%) and watching for metric improvement as share grows. In Pinterest's case the over-calibration issue persisted at higher shares — exposure bias was not the cause.
Timeouts + serving failures — worse tail latency could degrade treatment. Testable by comparing p50/p90/p99 + success rate across control + treatment per tower. Pinterest saw no materially worse behavior.
Traffic share too small for significance — usually orthogonal to O/O; amplitude of the offline-online gap shouldn't shrink as you add power.

How to diagnose¶

Pinterest's three-layer framework: for each layer's hypotheses, ask "could this alone explain the gap?" and use data to accept or reject. Key disciplines:

Don't stop at one cause. Even after ruling out layer-1 issues, layers 2 + 3 usually both contribute.
Instrument serving-path coverage. Feature coverage dashboards make layer-2 feature-parity gaps visible at all.
Run controlled sweeps for layer-2 hazards. For embedding version skew: fix one tower, vary the other, measure loss / calibration degradation.
Track funnel recall, not just model metric. Retrieval recall + ranking recall of the new L1 output against ground-truth auction winners + downstream utility are the real ceiling.

How to prevent¶

Pinterest's closing frame: "O/O discrepancy is not something you debug at the end; it's something you design for from the start."

Treat model + embeddings + feature pipelines as one system. A feature that exists in training logs isn't a feature you have; you only have features that exist in the serving artifact.
Make debuggability part of the product. Coverage dashboards, embedding-skew tests, parity harnesses are first-class infrastructure, not side-investments.
Gate model readiness on skew sensitivity. Skew sweeps as a standard readiness check.
Change tooling defaults to close bug classes. Pinterest changed UFR's default so that "features onboarded for L2 are automatically considered for L1 embedding usage" — a single default change closed a recurring parity-gap bug.

Relation to other skew concepts¶

Training / serving boundary — the organizational / infrastructure split. O/O discrepancy is often a symptom of a poorly-bridged training-serving boundary, specifically around features + model artifacts.
Embedding version skew — a specific layer-2 cause of O/O in two-tower systems.
Feature-store parity — the broader principle that the training-time feature view must equal the serving-time feature view.

Seen in¶

sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — canonical wiki instance. 20–45% offline LogMAE reduction vs neutral/negative CPA online; three-layer diagnosis; exposure bias + timeouts + offline-eval bugs ruled out; feature parity gap + embedding version skew found + fixed; closing "design for O/O from the start" shift.
sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use — structural prevention at the data-substrate layer: Pinterest's user-sequence platform applies one definition, many runtimes + shared execution engine + pluggable executors specifically to avoid the "split-brain failure mode where training pipelines build sequences one way from batch tables while serving systems assemble sequences a different way from online stores. Over time, those two views naturally drift apart in subtle ways." This is the substrate-side complement to the 2026-02-27 model-side O/O diagnosis: O/O gets debugged at the model layer when the substrate isn't structurally aligned, and gets prevented at the substrate layer when it is.