CONCEPT Cited by 1 source

Winning-indicator t-test¶

Definition¶

In interleaving testing of two rankings A and B, each search (or user) produces a winning indicator — a per-unit signed value:

+1 if A won
−1 if B won
0 (or a tie-adjusted fraction) if the unit tied

The winning-indicator t-test tests whether the mean of the winning-indicator distribution is significantly different from zero — the null being "no user preference between A and B."

t = mean(winning_indicators) / (sample_std / √n)

A significant deviation from 0 in either direction is evidence of user preference for A or B.

Why it matters¶

The standard non-parametric alternative — bootstrap percentile method on the lift metric — "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion" (Expedia, 2026-02-17).

The t-test substitutes in a constant-time parametric test — after computing the winning indicators, the mean and standard deviation are O(n) once, not the O(resamples · n) work of bootstrap. Expedia's claim: "this approach yields virtually the same results as the bootstrapping approach but is considerably faster."

Why it works at production scale¶

The winning indicator distribution is discrete (values around {−1, 0, +1} with possible tie fractions) but its mean at large n approaches a normal distribution by the Central Limit Theorem. Under CLT conditions:

The sample mean's sampling distribution is approximately normal.
The t-test's assumption (normally-distributed mean) holds approximately.
The bootstrap and t-test converge to the same confidence interval.

At production query volumes (thousands to millions of searches per experiment), CLT holds comfortably; at low-traffic scale (early prototypes, narrow product segments), the discrete winning-indicator distribution may violate t-test assumptions — bootstrap is safer there but also often unreachable-fast.

Trade-offs¶

	Bootstrap percentile method	Winning-indicator t-test
Assumptions	None (non-parametric)	Approximately-normal mean (CLT)
Speed	Slow — needs O(resamples · n) per CI	Fast — O(n) once
Distributed impl	Needs partition-reducible resampling	Trivial (mean + variance)
Behaviour at low n	Correct (slow but unbiased)	CLT may not hold — skewed
Production scale	Equivalent	"Virtually the same results"

Caveats¶

CLT assumption is approximate. For highly skewed winning-indicator distributions (e.g., most searches tie, few win decisively), the t-test's normality assumption is weaker; verify with bootstrap at least once per new product surface.
Ties need a fractional encoding. If ties map to 0 strictly, the variance is underestimated; if ties map to ±0.5 per side, variance is better estimated.
Cluster correlation. If per-user search sessions are correlated (one user's searches all tend to favour the same variant), the effective sample size is below the raw count, and both t-test and bootstrap need to account for it (e.g., user-level aggregation — Expedia's default).
One-sided vs two-sided. Expedia tests against 0 in either direction (two-sided). One-sided tests are more powerful but require a prior hypothesis.
Significance != launch-ready. A statistically significant win in interleaving is still only a directional signal; a launch decision needs A/B rollouts for CVR uplift and revenue impact.

Seen in¶

sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia Group Tech's replacement for bootstrap-percentile CI on lift; "virtually the same results ... considerably faster."

concepts/interleaving-testing — the technique that produces the winning indicators.
concepts/lift-metric — the metric whose mean the t-test is testing.
concepts/bootstrap-percentile-method — the non-parametric baseline.
patterns/t-test-over-bootstrap — the generalised pattern: substituting a CLT-backed t-test for bootstrap when performance matters and CLT applies.