Skip to content

PATTERN Cited by 1 source

Historical replay with ML outcome predictor

Problem

In a feedback-loop system (ad auctions, pricing, recommendation ranking, traffic routing), a small change to the decision function changes the downstream outcome, which changes the next decision. Pure back-testing fails: the historical record only contains outcomes under production decisions, not under the alternative treatment you want to evaluate.

Pure simulation also fails: generating synthetic traffic from scratch loses the distributional realism of real production state.

Solution

Replay historical inputs at natural system granularity (per-campaign-per-day, per-user-per-request, per-order-per-hour) through the alternative code path, but replace the "observe real outcome" step with a call to an ML outcome predictor trained on real (inputs, decision, outcome) data. Feed the predicted outcome back into the next tick of the simulation loop.

The pattern's three components are:

  1. Historical input data — pulled at the same grain production operates at (Yelp: campaign × date from Redshift).
  2. Alternative decision code — typically pulled in via patterns/production-code-as-submodule-for-simulation so simulation exercises the exact code under test.
  3. ML outcome predictor — non-parametric regressor (Yelp: CatBoost) that maps (features, decision) → expected outcome; stochasticity restored via concepts/poisson-sampling-for-integer-outcomes when the outcome is a count.

Same predictor for all candidates — this is the fairness requirement. If each candidate uses its own predictor, candidate deltas are confounded with predictor-choice deltas.

Canonical instance — Yelp Back-Testing Engine

Yelp's Back-Testing Engine (2026-02-02) simulates ad-budget-allocation algorithms using exactly this shape. Per candidate × campaign × day:

  • Beginning of day — Budgeting submodule (with candidate parameters) computes daily budget.
  • Throughout the day — CatBoost regressors predict impressions/clicks/leads from budget + campaign features; Poisson-sampled for integer counts.
  • End of day — Billing submodule (with candidate parameters) computes billing from simulated outcomes.

Day N+1's budget decision depends on Day N's simulated outcomes — this cascading dependency is named by Yelp as "a fundamental property to take into account" for the simulation loop to be realistic.

Why all three components are load-bearing

Drop any one and the simulation loses a critical property:

  • Without historical inputs — synthetic data may not cover long-tail cases or capture realistic distributions.
  • Without production code as submodule — re-implemented simulation logic drifts from production; wins don't transfer.
  • Without an ML outcome predictor — pure back-testing on historical outcomes assumes the decision doesn't change the outcome, which defeats the purpose when evaluating a new decision policy.

Fidelity vs cost

  • Fidelity upper-bounded by predictor accuracy. If the CatBoost model is systematically biased (e.g. overestimates clicks at high budgets), the simulation amplifies that bias across all candidates. Mitigation: train on out-of-sample holdout, monitor for drift, recalibrate against A/B outcomes.
  • Cost scales with (candidates × timesteps). Yelp's example is 25 candidates × 31 days per campaign × hundreds-of-thousands of campaigns. Parallelism per candidate (and possibly per day within a candidate, though day-to-day dependency limits this) is how the wall-clock stays manageable.

Workflow position

This pattern is the discovery phase upstream of A/B testing — canonical instance of concepts/filter-before-ab-test. Candidates that lose in the back-test are dropped; candidates that win in the back-test go to A/B for validation on fresh live data.

Applicability

Good fits:

  • Ad auctions / budget allocation (Yelp's case).
  • Dynamic pricing — historical transactions + demand model as outcome predictor.
  • Bandit / contextual-bandit policies — historical logs
  • reward model.
  • Recommendation ranking — offline policy evaluation via inverse propensity scoring or doubly robust estimators (technically a different estimator class but same pattern).

Poor fits:

  • Systems with fast drift — if the predictor's training distribution is stale, predictions are uninformative. Shorter retraining cycles or different methodologies (e.g. off-policy evaluation with importance weights) may be needed.
  • Systems where the counterfactual is far out-of-distribution — if candidate budgets differ from historical budgets by orders of magnitude, predictor extrapolation dominates.

Relation to other patterns

Seen in

Last updated · 476 distilled / 1,218 read