PATTERN Cited by 1 source
Historical replay with ML outcome predictor¶
Problem¶
In a feedback-loop system (ad auctions, pricing, recommendation ranking, traffic routing), a small change to the decision function changes the downstream outcome, which changes the next decision. Pure back-testing fails: the historical record only contains outcomes under production decisions, not under the alternative treatment you want to evaluate.
Pure simulation also fails: generating synthetic traffic from scratch loses the distributional realism of real production state.
Solution¶
Replay historical inputs at natural system granularity (per-campaign-per-day, per-user-per-request, per-order-per-hour) through the alternative code path, but replace the "observe real outcome" step with a call to an ML outcome predictor trained on real (inputs, decision, outcome) data. Feed the predicted outcome back into the next tick of the simulation loop.
The pattern's three components are:
- Historical input data — pulled at the same grain production operates at (Yelp: campaign × date from Redshift).
- Alternative decision code — typically pulled in via patterns/production-code-as-submodule-for-simulation so simulation exercises the exact code under test.
- ML outcome predictor — non-parametric regressor (Yelp: CatBoost) that maps (features, decision) → expected outcome; stochasticity restored via concepts/poisson-sampling-for-integer-outcomes when the outcome is a count.
Same predictor for all candidates — this is the fairness requirement. If each candidate uses its own predictor, candidate deltas are confounded with predictor-choice deltas.
Canonical instance — Yelp Back-Testing Engine¶
Yelp's Back-Testing Engine (2026-02-02) simulates ad-budget-allocation algorithms using exactly this shape. Per candidate × campaign × day:
- Beginning of day — Budgeting submodule (with candidate parameters) computes daily budget.
- Throughout the day — CatBoost regressors predict impressions/clicks/leads from budget + campaign features; Poisson-sampled for integer counts.
- End of day — Billing submodule (with candidate parameters) computes billing from simulated outcomes.
Day N+1's budget decision depends on Day N's simulated outcomes — this cascading dependency is named by Yelp as "a fundamental property to take into account" for the simulation loop to be realistic.
Why all three components are load-bearing¶
Drop any one and the simulation loses a critical property:
- Without historical inputs — synthetic data may not cover long-tail cases or capture realistic distributions.
- Without production code as submodule — re-implemented simulation logic drifts from production; wins don't transfer.
- Without an ML outcome predictor — pure back-testing on historical outcomes assumes the decision doesn't change the outcome, which defeats the purpose when evaluating a new decision policy.
Fidelity vs cost¶
- Fidelity upper-bounded by predictor accuracy. If the CatBoost model is systematically biased (e.g. overestimates clicks at high budgets), the simulation amplifies that bias across all candidates. Mitigation: train on out-of-sample holdout, monitor for drift, recalibrate against A/B outcomes.
- Cost scales with (candidates × timesteps). Yelp's example is 25 candidates × 31 days per campaign × hundreds-of-thousands of campaigns. Parallelism per candidate (and possibly per day within a candidate, though day-to-day dependency limits this) is how the wall-clock stays manageable.
Workflow position¶
This pattern is the discovery phase upstream of A/B testing — canonical instance of concepts/filter-before-ab-test. Candidates that lose in the back-test are dropped; candidates that win in the back-test go to A/B for validation on fresh live data.
Applicability¶
Good fits:
- Ad auctions / budget allocation (Yelp's case).
- Dynamic pricing — historical transactions + demand model as outcome predictor.
- Bandit / contextual-bandit policies — historical logs
- reward model.
- Recommendation ranking — offline policy evaluation via inverse propensity scoring or doubly robust estimators (technically a different estimator class but same pattern).
Poor fits:
- Systems with fast drift — if the predictor's training distribution is stale, predictions are uninformative. Shorter retraining cycles or different methodologies (e.g. off-policy evaluation with importance weights) may be needed.
- Systems where the counterfactual is far out-of-distribution — if candidate budgets differ from historical budgets by orders of magnitude, predictor extrapolation dominates.
Relation to other patterns¶
- patterns/snapshot-replay-agent-evaluation — sibling in agent engineering; replays historical agent inputs against new agent code. That pattern measures the agent's direct output (no ML outcome predictor), so it's closer to pure back-testing than this hybrid.
- patterns/production-code-as-submodule-for-simulation — the code-side fidelity mechanism this pattern relies on.
- patterns/yaml-declared-experiment-config — the configuration surface Yelp uses to declare what to replay.
- patterns/ab-test-rollout — the downstream validation step after this pattern filters candidates.
Seen in¶
- sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — canonical wiki instance.