CONCEPT Cited by 1 source
Computational backtest¶
Definition¶
A computational backtest is an offline evaluation of a decision-making system (trading algorithm, replenishment engine, recommender) by replaying historical data through the system and comparing its recommendations against the actual decisions taken (or a baseline policy). Unlike A/B tests which require production traffic split, backtests use only historical data — no live experimentation — and measure a counterfactual uplift: what would have happened if we had followed the system's recommendations instead of the actual decisions?
In replenishment context:
- Replay: feed historical article × merchant × week inputs (prices, inbounds, inventory state, demand) into the engine.
- Generate: have the engine produce replenishment decisions for each historical period.
- Simulate: forward-simulate sales, returns, stockouts using the engine's decisions (with Monte Carlo / DES over realised demand).
- Compare: measure key metrics (GMV, margin, fill rate, availability) vs the historical baseline.
- Report: the delta is the counterfactual uplift.
Canonical instance (Zalando ZEOS paper)¶
The Nature Scientific Reports paper's backtest:
| Dimension | Value |
|---|---|
| Period | Oct 2023 – Sep 2024 (12 months) |
| Articles | ~2,000,000 |
| Merchants | ~800 |
| Baseline | Professional human replenishment decisions |
| Metrics | GMV, GMV after FC, availability, demand fill rate |
Results (verbatim table):
| Metric | Engine vs Human Uplift |
|---|---|
| Gross Merchandise Value (GMV) | +22.11% |
| Gross Margin (GMV after FC) | +21.95% |
| Weighted Weekly Availability | +33.63% |
| Weighted Demand Fill Rate | +23.63% |
Performance characteristics disclosed:
- Consistent seasonal performance. Positive uplifts "remained remarkably stable throughout the 12-month period, demonstrating the engine's ability to navigate high-variance seasonal peaks and troughs without performance degradation."
- Stable high service levels. Engine reaches absolute 86.40% availability and 91.14% fill rate — gains are not from aggressive overstocking.
- Broad generalisation. 70–80% of merchants saw positive financial uplifts, demonstrating that the probabilistic approach generalises across diverse article types.
The 100%-adoption caveat¶
The published uplift numbers assume every merchant follows every recommendation. Verbatim:
"Note on backtest implications: It is important to clarify that the uplifts cited above represent a theoretical scenario of 100% user adoption. Because the tool serves as an AI decision-support assistant, the final authority remains with the merchants. Actual results will vary depending on how consistently merchants choose to implement the system's suggestions."
This is the decision-support vs automation caveat: the engine recommends decisions, merchants decide. Realised uplift ≤ theoretical uplift, scaled by average adoption rate × average per-decision quality.
Methodological choices in the Zalando backtest¶
Several methodological choices shape the interpretation:
- Human baseline, not no-model baseline. Compared to real human replenishment decisions, not to "no replenishment at all" (which would be meaningless) or to "Tuned (s, S)" (which is in a separate comparison).
- Weighted metrics. "Weighted weekly availability" — weighted by sales volume or impressions, not a simple average. Load-bearing because a few high-volume articles dominate revenue.
- DES-driven counterfactual simulation. Replay uses the
same DES as
production — so the backtest is comparing
DES(engine_decisions)toDES(human_decisions)over the same stochastic demand samples per period. - One-year window. Captures one full seasonal cycle — enough to avoid seasonal cherry-picking. Shorter windows (a quarter) might over-state or under-state depending on season.
What the backtest does NOT measure¶
Standard backtest caveats that apply here:
- Not a live test. No real-world execution risk (warehouse operational issues, supplier constraints, late inbounds) is tested.
- No causal inference. The 22% uplift is correlational — conditional on the simulation model being faithful. If the DES over-estimates stockout costs or mis-specifies lead-time distributions, the backtest's delta is biased.
- Observational distribution bias. Historical inventory states reflect past human decisions — the backtest replays engine decisions against those states, but the states themselves would be different if the engine had been in control earlier (the "off-policy evaluation" problem).
- No merchant-behaviour change model. If deploying the engine causes merchants to change their pricing, merchandising, or assortment strategies, that second-order effect isn't captured.
Tradeoffs vs A/B testing¶
- Backtest advantages. Cheap, fast, no production risk, can run on full historical catalogue, no need for control-group merchant cohort.
- Backtest disadvantages. No real-world execution, assumes simulation model is faithful, off-policy evaluation bias, doesn't test adoption dynamics.
- Canonical combination. Run backtest to shortlist candidate policies + validate expected uplift magnitude; run A/B test on top candidates to validate realised uplift.
Seen in¶
- sources/2026-01-14-zalando-paper-announcement-replenishment-optimization-extended-rsq — canonical first disclosure. Paper announcement frames the 12-month × 2M-article × 800-merchant computational backtest as the primary empirical validation of the Extended (R, s, Q) + DES + P75 approach. Theoretical-100%-adoption caveat explicitly called out.
Related¶
- concepts/ablation-study-forecast-vs-objective — the ablation decomposition of which design choices contribute to the backtest uplift.
- concepts/discrete-event-simulation — the simulation machinery driving the backtest replay.
- concepts/monte-carlo-simulation-under-uncertainty — the outer loop over demand realisations.
- concepts/probabilistic-demand-forecast — the forecast whose uplift is measured.
- systems/zeos-replenishment-recommender
- companies/zalando