Skip to content

CONCEPT Cited by 1 source

Computational backtest

Definition

A computational backtest is an offline evaluation of a decision-making system (trading algorithm, replenishment engine, recommender) by replaying historical data through the system and comparing its recommendations against the actual decisions taken (or a baseline policy). Unlike A/B tests which require production traffic split, backtests use only historical data — no live experimentation — and measure a counterfactual uplift: what would have happened if we had followed the system's recommendations instead of the actual decisions?

In replenishment context:

  • Replay: feed historical article × merchant × week inputs (prices, inbounds, inventory state, demand) into the engine.
  • Generate: have the engine produce replenishment decisions for each historical period.
  • Simulate: forward-simulate sales, returns, stockouts using the engine's decisions (with Monte Carlo / DES over realised demand).
  • Compare: measure key metrics (GMV, margin, fill rate, availability) vs the historical baseline.
  • Report: the delta is the counterfactual uplift.

Canonical instance (Zalando ZEOS paper)

The Nature Scientific Reports paper's backtest:

Dimension Value
Period Oct 2023 – Sep 2024 (12 months)
Articles ~2,000,000
Merchants ~800
Baseline Professional human replenishment decisions
Metrics GMV, GMV after FC, availability, demand fill rate

Results (verbatim table):

Metric Engine vs Human Uplift
Gross Merchandise Value (GMV) +22.11%
Gross Margin (GMV after FC) +21.95%
Weighted Weekly Availability +33.63%
Weighted Demand Fill Rate +23.63%

Performance characteristics disclosed:

  • Consistent seasonal performance. Positive uplifts "remained remarkably stable throughout the 12-month period, demonstrating the engine's ability to navigate high-variance seasonal peaks and troughs without performance degradation."
  • Stable high service levels. Engine reaches absolute 86.40% availability and 91.14% fill rate — gains are not from aggressive overstocking.
  • Broad generalisation. 70–80% of merchants saw positive financial uplifts, demonstrating that the probabilistic approach generalises across diverse article types.

The 100%-adoption caveat

The published uplift numbers assume every merchant follows every recommendation. Verbatim:

"Note on backtest implications: It is important to clarify that the uplifts cited above represent a theoretical scenario of 100% user adoption. Because the tool serves as an AI decision-support assistant, the final authority remains with the merchants. Actual results will vary depending on how consistently merchants choose to implement the system's suggestions."

This is the decision-support vs automation caveat: the engine recommends decisions, merchants decide. Realised uplift ≤ theoretical uplift, scaled by average adoption rate × average per-decision quality.

Methodological choices in the Zalando backtest

Several methodological choices shape the interpretation:

  1. Human baseline, not no-model baseline. Compared to real human replenishment decisions, not to "no replenishment at all" (which would be meaningless) or to "Tuned (s, S)" (which is in a separate comparison).
  2. Weighted metrics. "Weighted weekly availability" — weighted by sales volume or impressions, not a simple average. Load-bearing because a few high-volume articles dominate revenue.
  3. DES-driven counterfactual simulation. Replay uses the same DES as production — so the backtest is comparing DES(engine_decisions) to DES(human_decisions) over the same stochastic demand samples per period.
  4. One-year window. Captures one full seasonal cycle — enough to avoid seasonal cherry-picking. Shorter windows (a quarter) might over-state or under-state depending on season.

What the backtest does NOT measure

Standard backtest caveats that apply here:

  • Not a live test. No real-world execution risk (warehouse operational issues, supplier constraints, late inbounds) is tested.
  • No causal inference. The 22% uplift is correlational — conditional on the simulation model being faithful. If the DES over-estimates stockout costs or mis-specifies lead-time distributions, the backtest's delta is biased.
  • Observational distribution bias. Historical inventory states reflect past human decisions — the backtest replays engine decisions against those states, but the states themselves would be different if the engine had been in control earlier (the "off-policy evaluation" problem).
  • No merchant-behaviour change model. If deploying the engine causes merchants to change their pricing, merchandising, or assortment strategies, that second-order effect isn't captured.

Tradeoffs vs A/B testing

  • Backtest advantages. Cheap, fast, no production risk, can run on full historical catalogue, no need for control-group merchant cohort.
  • Backtest disadvantages. No real-world execution, assumes simulation model is faithful, off-policy evaluation bias, doesn't test adoption dynamics.
  • Canonical combination. Run backtest to shortlist candidate policies + validate expected uplift magnitude; run A/B test on top candidates to validate realised uplift.

Seen in

Last updated · 428 distilled / 1,221 read