Skip to content

CONCEPT Cited by 1 source

Overfitting to historical data

Definition

Overfitting to historical data is the failure mode of any back-testing / historical-replay methodology where optimisation against past data produces candidates that win on history but fail to generalise to the future. The history-optimal candidate may exploit spurious correlations, one-off events, or distributional features that won't persist.

This is distinct from ML overfitting in the narrow sense (model memorising training examples); here it's the experimentation framework itself that overfits — even if each ML model generalises well, the candidate selected by repeated back-testing against the same history is biased toward patterns in that history.

Why it matters in system design

Any simulation-driven search process — hyperparameter tuning, configuration tuning, algorithm-parameter optimisation — has this risk when the optimiser iterates enough times against a fixed historical dataset. The optimiser will find patterns that maximise the observed objective, some of which are real and some of which are noise.

The risk is proportional to:

  • Evaluation budget — more candidates ≈ more chances to chance-fit.
  • Flexibility of the search space — richer parameterisations can fit more noise.
  • Narrowness of the historical window — short windows don't sample the distribution well.
  • Dependence on a single predictor — if a counterfactual- outcome model has systematic biases, the optimiser exploits them.

Yelp's named mitigation

Yelp's Back-Testing Engine (2026-02-02) does not technically prevent overfitting; instead it explicitly keeps A/B tests and real-world monitoring in the loop downstream of back-testing:

Verbatim: "Risk of overfitting to history: Relying too heavily on historical simulations could bias development toward optimizations that perform well on past data, potentially limiting innovation. ... Being aware of these caveats helps us use back-testing more effectively, complementing it with A/B tests and real-world monitoring to ensure robust, reliable improvements."

The back-test-then-A/B workflow is the organisational guardrail: back-testing picks promising candidates, A/B tests validate them on fresh live data that wasn't part of the back-test corpus. This is a canonical instance of concepts/filter-before-ab-test where filtering is explicitly not a substitute for the downstream validation step.

Why a "limits innovation" framing matters

Yelp frames overfitting not just as a statistical failure mode but as a strategic one: if every candidate must look good on recent history, radically new approaches that would take time to show value are systematically rejected. This is the same dynamic that makes market-fit metrics conservative in product development.

Mitigations (general)

  • Out-of-sample holdout — reserve a historical window for final validation; don't optimise against it.
  • Cross-validation across time windows — require candidates to win across multiple non-overlapping historical periods.
  • Evaluation-budget caps — limit max_evals to reduce the risk of fitting noise.
  • A/B test downstream — the decisive mitigation; validate on fresh data.
  • Diversity penalties — penalise candidates too close to already-explored history-fitting ones.

Yelp names only the A/B-test-downstream mitigation explicitly; the others are general practice.

Seen in

Last updated · 476 distilled / 1,218 read