Skip to content

CONCEPT Cited by 1 source

Winning-indicator t-test

Definition

In interleaving testing of two rankings A and B, each search (or user) produces a winning indicator — a per-unit signed value:

  • +1 if A won
  • −1 if B won
  • 0 (or a tie-adjusted fraction) if the unit tied

The winning-indicator t-test tests whether the mean of the winning-indicator distribution is significantly different from zero — the null being "no user preference between A and B."

t = mean(winning_indicators) / (sample_std / √n)

A significant deviation from 0 in either direction is evidence of user preference for A or B.

Why it matters

The standard non-parametric alternative — bootstrap percentile method on the lift metric"works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion" (Expedia, 2026-02-17).

The t-test substitutes in a constant-time parametric test — after computing the winning indicators, the mean and standard deviation are O(n) once, not the O(resamples · n) work of bootstrap. Expedia's claim: "this approach yields virtually the same results as the bootstrapping approach but is considerably faster."

Why it works at production scale

The winning indicator distribution is discrete (values around {−1, 0, +1} with possible tie fractions) but its mean at large n approaches a normal distribution by the Central Limit Theorem. Under CLT conditions:

  • The sample mean's sampling distribution is approximately normal.
  • The t-test's assumption (normally-distributed mean) holds approximately.
  • The bootstrap and t-test converge to the same confidence interval.

At production query volumes (thousands to millions of searches per experiment), CLT holds comfortably; at low-traffic scale (early prototypes, narrow product segments), the discrete winning-indicator distribution may violate t-test assumptions — bootstrap is safer there but also often unreachable-fast.

Trade-offs

Bootstrap percentile method Winning-indicator t-test
Assumptions None (non-parametric) Approximately-normal mean (CLT)
Speed Slow — needs O(resamples · n) per CI Fast — O(n) once
Distributed impl Needs partition-reducible resampling Trivial (mean + variance)
Behaviour at low n Correct (slow but unbiased) CLT may not hold — skewed
Production scale Equivalent "Virtually the same results"

Caveats

  • CLT assumption is approximate. For highly skewed winning-indicator distributions (e.g., most searches tie, few win decisively), the t-test's normality assumption is weaker; verify with bootstrap at least once per new product surface.
  • Ties need a fractional encoding. If ties map to 0 strictly, the variance is underestimated; if ties map to ±0.5 per side, variance is better estimated.
  • Cluster correlation. If per-user search sessions are correlated (one user's searches all tend to favour the same variant), the effective sample size is below the raw count, and both t-test and bootstrap need to account for it (e.g., user-level aggregation — Expedia's default).
  • One-sided vs two-sided. Expedia tests against 0 in either direction (two-sided). One-sided tests are more powerful but require a prior hypothesis.
  • Significance != launch-ready. A statistically significant win in interleaving is still only a directional signal; a launch decision needs A/B rollouts for CVR uplift and revenue impact.

Seen in

Last updated · 200 distilled / 1,178 read