CONCEPT Cited by 1 source

Hybrid back-testing with ML counterfactual¶

Definition¶

Hybrid back-testing is the pattern of replaying historical system state through alternative code paths and using ML models to predict the outcomes that would have occurred under the alternative treatment but never actually happened.

It sits between two extremes:

Pure back-testing — replay historical decisions on historical outcomes as-is. Appropriate when the alternative code path produces the same observable downstream effect (e.g. a refactor with identical behaviour). No ML needed.
Pure simulation — generate synthetic inputs + outcomes entirely from a model, no historical anchor. Appropriate when no relevant production data exists.

Hybrid back-testing re-uses historical inputs (which grounds the simulation in real-world distributions) but predicts outcomes for the alternative treatment (which lets you evaluate code paths that produce different actions than production actually took).

Why it exists¶

In control-feedback systems — ad auctions, pricing, routing, recommendations — a small change in the decision changes the downstream outcome, which changes the next decision. Pure back-testing can't evaluate these because the new decisions would have produced outcomes the historical record doesn't contain. But simulating from scratch throws away the distributional realism of actual production data.

Hybrid: keep real inputs (campaign features, user context, historical state); predict outcomes from a model that was trained on real (input, decision, outcome) triples; let the simulation iterate the decision loop with predicted outcomes feeding the next decision.

Canonical instance — Yelp Ad Budget Allocation¶

Yelp's Back-Testing Engine (2026-02-02) is the canonical wiki example. Historical campaign data is real (pulled from Redshift at campaign × date grain); daily budget decisions are re-computed by the proposed algorithm code path; daily outcomes (impressions, clicks, leads) are predicted by CatBoost regressors from the newly-computed budget + campaign features; outcomes feed back into the next day's budget decision via the same production-code-as-submodule simulation loop.

Verbatim from the post: "The use of ML models to predict counterfactual outcomes means this is not a pure back-testing approach, but rather a hybrid that combines elements of both simulation and back-testing."

Properties¶

Fair cross-candidate comparison requires shared models. If every candidate used its own outcome predictor, differences between candidates could just reflect predictor noise. Yelp explicitly: "Using the same ML models for all candidates promotes fair comparisons."
Fidelity is bounded by the outcome predictor's accuracy. Named by Yelp as a caveat: "The accuracy of this methodology depends heavily on the quality and generalizability of the underlying ML models."
Overfitting risk is real — see concepts/overfitting-to-historical-data. Mitigation is keeping A/B tests + real-world monitoring downstream in the loop.

Relation to pure replay¶

patterns/snapshot-replay-agent-evaluation (Databricks) is a sibling pattern at a different altitude: it replays historical agent inputs against new agent code, but the "outcome" measured is the agent's structured output at inference time (which the new code generates deterministically), not a counterfactual prediction. Databricks' pattern is closer to pure back-testing than hybrid — there's no ML layer predicting what would have happened.

Seen in¶

sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — canonical wiki instance.

concepts/counterfactual-outcome-prediction — the sub-concept (what the ML model is predicting).
concepts/filter-before-ab-test — the experimentation-workflow position (pre-A/B filter).
concepts/overfitting-to-historical-data — the named risk.
patterns/historical-replay-with-ml-outcome-predictor — the implementation pattern.
patterns/production-code-as-submodule-for-simulation — the fidelity primitive for the code-path side.
systems/yelp-back-testing-engine