CONCEPT Cited by 1 source

Computational backtest¶

Definition¶

A computational backtest is an offline evaluation of a decision-making system (trading algorithm, replenishment engine, recommender) by replaying historical data through the system and comparing its recommendations against the actual decisions taken (or a baseline policy). Unlike A/B tests which require production traffic split, backtests use only historical data — no live experimentation — and measure a counterfactual uplift: what would have happened if we had followed the system's recommendations instead of the actual decisions?

In replenishment context:

Replay: feed historical article × merchant × week inputs (prices, inbounds, inventory state, demand) into the engine.
Generate: have the engine produce replenishment decisions for each historical period.
Simulate: forward-simulate sales, returns, stockouts using the engine's decisions (with Monte Carlo / DES over realised demand).
Compare: measure key metrics (GMV, margin, fill rate, availability) vs the historical baseline.
Report: the delta is the counterfactual uplift.

Canonical instance (Zalando ZEOS paper)¶

The Nature Scientific Reports paper's backtest:

Dimension	Value
Period	Oct 2023 – Sep 2024 (12 months)
Articles	~2,000,000
Merchants	~800
Baseline	Professional human replenishment decisions
Metrics	GMV, GMV after FC, availability, demand fill rate

Results (verbatim table):

Metric	Engine vs Human Uplift
Gross Merchandise Value (GMV)	+22.11%
Gross Margin (GMV after FC)	+21.95%
Weighted Weekly Availability	+33.63%
Weighted Demand Fill Rate	+23.63%

Performance characteristics disclosed:

Consistent seasonal performance. Positive uplifts "remained remarkably stable throughout the 12-month period, demonstrating the engine's ability to navigate high-variance seasonal peaks and troughs without performance degradation."
Stable high service levels. Engine reaches absolute 86.40% availability and 91.14% fill rate — gains are not from aggressive overstocking.
Broad generalisation. 70–80% of merchants saw positive financial uplifts, demonstrating that the probabilistic approach generalises across diverse article types.

The 100%-adoption caveat¶

The published uplift numbers assume every merchant follows every recommendation. Verbatim:

"Note on backtest implications: It is important to clarify that the uplifts cited above represent a theoretical scenario of 100% user adoption. Because the tool serves as an AI decision-support assistant, the final authority remains with the merchants. Actual results will vary depending on how consistently merchants choose to implement the system's suggestions."

This is the decision-support vs automation caveat: the engine recommends decisions, merchants decide. Realised uplift ≤ theoretical uplift, scaled by average adoption rate × average per-decision quality.

Methodological choices in the Zalando backtest¶

Several methodological choices shape the interpretation:

Human baseline, not no-model baseline. Compared to real human replenishment decisions, not to "no replenishment at all" (which would be meaningless) or to "Tuned (s, S)" (which is in a separate comparison).
Weighted metrics. "Weighted weekly availability" — weighted by sales volume or impressions, not a simple average. Load-bearing because a few high-volume articles dominate revenue.
DES-driven counterfactual simulation. Replay uses the same DES as production — so the backtest is comparing DES(engine_decisions) to DES(human_decisions) over the same stochastic demand samples per period.
One-year window. Captures one full seasonal cycle — enough to avoid seasonal cherry-picking. Shorter windows (a quarter) might over-state or under-state depending on season.

What the backtest does NOT measure¶

Standard backtest caveats that apply here:

Not a live test. No real-world execution risk (warehouse operational issues, supplier constraints, late inbounds) is tested.
No causal inference. The 22% uplift is correlational — conditional on the simulation model being faithful. If the DES over-estimates stockout costs or mis-specifies lead-time distributions, the backtest's delta is biased.
Observational distribution bias. Historical inventory states reflect past human decisions — the backtest replays engine decisions against those states, but the states themselves would be different if the engine had been in control earlier (the "off-policy evaluation" problem).
No merchant-behaviour change model. If deploying the engine causes merchants to change their pricing, merchandising, or assortment strategies, that second-order effect isn't captured.

Tradeoffs vs A/B testing¶

Backtest advantages. Cheap, fast, no production risk, can run on full historical catalogue, no need for control-group merchant cohort.
Backtest disadvantages. No real-world execution, assumes simulation model is faithful, off-policy evaluation bias, doesn't test adoption dynamics.
Canonical combination. Run backtest to shortlist candidate policies + validate expected uplift magnitude; run A/B test on top candidates to validate realised uplift.

Seen in¶

sources/2026-01-14-zalando-paper-announcement-replenishment-optimization-extended-rsq — canonical first disclosure. Paper announcement frames the 12-month × 2M-article × 800-merchant computational backtest as the primary empirical validation of the Extended (R, s, Q) + DES + P75 approach. Theoretical-100%-adoption caveat explicitly called out.

concepts/ablation-study-forecast-vs-objective — the ablation decomposition of which design choices contribute to the backtest uplift.
concepts/discrete-event-simulation — the simulation machinery driving the backtest replay.
concepts/monte-carlo-simulation-under-uncertainty — the outer loop over demand realisations.
concepts/probabilistic-demand-forecast — the forecast whose uplift is measured.
systems/zeos-replenishment-recommender
companies/zalando