CONCEPT Cited by 1 source
Winning-indicator t-test¶
Definition¶
In interleaving testing of two rankings A and B, each search (or user) produces a winning indicator — a per-unit signed value:
+1if A won−1if B won0(or a tie-adjusted fraction) if the unit tied
The winning-indicator t-test tests whether the mean of the winning-indicator distribution is significantly different from zero — the null being "no user preference between A and B."
A significant deviation from 0 in either direction is evidence of user preference for A or B.
Why it matters¶
The standard non-parametric alternative — bootstrap percentile method on the lift metric — "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion" (Expedia, 2026-02-17).
The t-test substitutes in a constant-time parametric test — after computing the winning indicators, the mean and standard deviation are O(n) once, not the O(resamples · n) work of bootstrap. Expedia's claim: "this approach yields virtually the same results as the bootstrapping approach but is considerably faster."
Why it works at production scale¶
The winning indicator distribution is discrete (values around
{−1, 0, +1} with possible tie fractions) but its mean at large n
approaches a normal distribution by the Central Limit Theorem. Under
CLT conditions:
- The sample mean's sampling distribution is approximately normal.
- The t-test's assumption (normally-distributed mean) holds approximately.
- The bootstrap and t-test converge to the same confidence interval.
At production query volumes (thousands to millions of searches per experiment), CLT holds comfortably; at low-traffic scale (early prototypes, narrow product segments), the discrete winning-indicator distribution may violate t-test assumptions — bootstrap is safer there but also often unreachable-fast.
Trade-offs¶
| Bootstrap percentile method | Winning-indicator t-test | |
|---|---|---|
| Assumptions | None (non-parametric) | Approximately-normal mean (CLT) |
| Speed | Slow — needs O(resamples · n) per CI | Fast — O(n) once |
| Distributed impl | Needs partition-reducible resampling | Trivial (mean + variance) |
| Behaviour at low n | Correct (slow but unbiased) | CLT may not hold — skewed |
| Production scale | Equivalent | "Virtually the same results" |
Caveats¶
- CLT assumption is approximate. For highly skewed winning-indicator distributions (e.g., most searches tie, few win decisively), the t-test's normality assumption is weaker; verify with bootstrap at least once per new product surface.
- Ties need a fractional encoding. If ties map to
0strictly, the variance is underestimated; if ties map to±0.5per side, variance is better estimated. - Cluster correlation. If per-user search sessions are correlated (one user's searches all tend to favour the same variant), the effective sample size is below the raw count, and both t-test and bootstrap need to account for it (e.g., user-level aggregation — Expedia's default).
- One-sided vs two-sided. Expedia tests against 0 in either direction (two-sided). One-sided tests are more powerful but require a prior hypothesis.
- Significance != launch-ready. A statistically significant win in interleaving is still only a directional signal; a launch decision needs A/B rollouts for CVR uplift and revenue impact.
Seen in¶
- sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia Group Tech's replacement for bootstrap-percentile CI on lift; "virtually the same results ... considerably faster."
Related¶
- concepts/interleaving-testing — the technique that produces the winning indicators.
- concepts/lift-metric — the metric whose mean the t-test is testing.
- concepts/bootstrap-percentile-method — the non-parametric baseline.
- patterns/t-test-over-bootstrap — the generalised pattern: substituting a CLT-backed t-test for bootstrap when performance matters and CLT applies.