CONCEPT Cited by 1 source

Overfitting to historical data¶

Definition¶

Overfitting to historical data is the failure mode of any back-testing / historical-replay methodology where optimisation against past data produces candidates that win on history but fail to generalise to the future. The history-optimal candidate may exploit spurious correlations, one-off events, or distributional features that won't persist.

This is distinct from ML overfitting in the narrow sense (model memorising training examples); here it's the experimentation framework itself that overfits — even if each ML model generalises well, the candidate selected by repeated back-testing against the same history is biased toward patterns in that history.

Why it matters in system design¶

Any simulation-driven search process — hyperparameter tuning, configuration tuning, algorithm-parameter optimisation — has this risk when the optimiser iterates enough times against a fixed historical dataset. The optimiser will find patterns that maximise the observed objective, some of which are real and some of which are noise.

The risk is proportional to:

Evaluation budget — more candidates ≈ more chances to chance-fit.
Flexibility of the search space — richer parameterisations can fit more noise.
Narrowness of the historical window — short windows don't sample the distribution well.
Dependence on a single predictor — if a counterfactual- outcome model has systematic biases, the optimiser exploits them.

Yelp's named mitigation¶

Yelp's Back-Testing Engine (2026-02-02) does not technically prevent overfitting; instead it explicitly keeps A/B tests and real-world monitoring in the loop downstream of back-testing:

Verbatim: "Risk of overfitting to history: Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation. ... Being aware of these caveats helps us use back-testing more effectively, complementing it with A/B tests and real-world monitoring to ensure robust, reliable improvements."

The back-test-then-A/B workflow is the organisational guardrail: back-testing picks promising candidates, A/B tests validate them on fresh live data that wasn't part of the back-test corpus. This is a canonical instance of concepts/filter-before-ab-test where filtering is explicitly not a substitute for the downstream validation step.

Why a "limits innovation" framing matters¶

Yelp frames overfitting not just as a statistical failure mode but as a strategic one: if every candidate must look good on recent history, radically new approaches that would take time to show value are systematically rejected. This is the same dynamic that makes market-fit metrics conservative in product development.

Mitigations (general)¶

Out-of-sample holdout — reserve a historical window for final validation; don't optimise against it.
Cross-validation across time windows — require candidates to win across multiple non-overlapping historical periods.
Evaluation-budget caps — limit max_evals to reduce the risk of fitting noise.
A/B test downstream — the decisive mitigation; validate on fresh data.
Diversity penalties — penalise candidates too close to already-explored history-fitting ones.

Yelp names only the A/B-test-downstream mitigation explicitly; the others are general practice.

Seen in¶

sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — named explicitly as a limitation of the Back-Testing Engine.

concepts/hybrid-backtesting-with-ml-counterfactual — the methodology that has this risk.
concepts/counterfactual-outcome-prediction — the ML layer whose biases can amplify overfitting.
concepts/filter-before-ab-test — the workflow position where A/B acts as the generalisation check.
systems/yelp-back-testing-engine