CONCEPT Cited by 1 source
Overfitting to historical data¶
Definition¶
Overfitting to historical data is the failure mode of any back-testing / historical-replay methodology where optimisation against past data produces candidates that win on history but fail to generalise to the future. The history-optimal candidate may exploit spurious correlations, one-off events, or distributional features that won't persist.
This is distinct from ML overfitting in the narrow sense (model memorising training examples); here it's the experimentation framework itself that overfits — even if each ML model generalises well, the candidate selected by repeated back-testing against the same history is biased toward patterns in that history.
Why it matters in system design¶
Any simulation-driven search process — hyperparameter tuning, configuration tuning, algorithm-parameter optimisation — has this risk when the optimiser iterates enough times against a fixed historical dataset. The optimiser will find patterns that maximise the observed objective, some of which are real and some of which are noise.
The risk is proportional to:
- Evaluation budget — more candidates ≈ more chances to chance-fit.
- Flexibility of the search space — richer parameterisations can fit more noise.
- Narrowness of the historical window — short windows don't sample the distribution well.
- Dependence on a single predictor — if a counterfactual- outcome model has systematic biases, the optimiser exploits them.
Yelp's named mitigation¶
Yelp's Back-Testing Engine (2026-02-02) does not technically prevent overfitting; instead it explicitly keeps A/B tests and real-world monitoring in the loop downstream of back-testing:
Verbatim: "Risk of overfitting to history: Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation. ... Being aware of these caveats helps us use back-testing more effectively, complementing it with A/B tests and real-world monitoring to ensure robust, reliable improvements."
The back-test-then-A/B workflow is the organisational guardrail: back-testing picks promising candidates, A/B tests validate them on fresh live data that wasn't part of the back-test corpus. This is a canonical instance of concepts/filter-before-ab-test where filtering is explicitly not a substitute for the downstream validation step.
Why a "limits innovation" framing matters¶
Yelp frames overfitting not just as a statistical failure mode but as a strategic one: if every candidate must look good on recent history, radically new approaches that would take time to show value are systematically rejected. This is the same dynamic that makes market-fit metrics conservative in product development.
Mitigations (general)¶
- Out-of-sample holdout — reserve a historical window for final validation; don't optimise against it.
- Cross-validation across time windows — require candidates to win across multiple non-overlapping historical periods.
- Evaluation-budget caps — limit
max_evalsto reduce the risk of fitting noise. - A/B test downstream — the decisive mitigation; validate on fresh data.
- Diversity penalties — penalise candidates too close to already-explored history-fitting ones.
Yelp names only the A/B-test-downstream mitigation explicitly; the others are general practice.
Seen in¶
- sources/2026-02-02-yelp-back-testing-engine-ad-budget-allocation — named explicitly as a limitation of the Back-Testing Engine.
Related¶
- concepts/hybrid-backtesting-with-ml-counterfactual — the methodology that has this risk.
- concepts/counterfactual-outcome-prediction — the ML layer whose biases can amplify overfitting.
- concepts/filter-before-ab-test — the workflow position where A/B acts as the generalisation check.
- systems/yelp-back-testing-engine