PATTERN Cited by 1 source
Interleaved ranking evaluation¶
Intent¶
Evaluate a candidate ranking change faster and more sensitively than A/B testing on CVR uplift can. Use as a screening layer to winnow candidate changes before A/B validation, not as a replacement for launch-decision A/B testing.
Structure¶
- Produce two rankings A and B for the same query / user / context from the same candidate pool. Both must be populated enough to interleave sensibly.
- Interleave their slots into one displayed list per impression, tagging each slot's source ranking. Variant selection (balanced, team-draft, probabilistic) affects position-bias correction but not the pattern shape.
- Display the interleaved list to the user — one user = one list with contributions from both A and B.
- Attribute events to source ranking:
- Click events (e.g., property-detail-page view) — high frequency, fast detection.
- Booking events (completed transactions) — rare, close to revenue. Track both independently.
- Compute per-search winning indicator: +1 if A's attributed events > B's, −1 if reverse, 0 (or fractional) for ties.
- (Optional) Aggregate to per-user by majority-vote per user — Expedia's default, because it bounds the influence of heavy-searcher users.
- Aggregate into lift. Report separately for clicks and bookings.
- Test significance with winning-indicator t-test (fast, CLT-backed) or bootstrap percentile (non-parametric, slower).
- Promote or kill the candidate:
- Significant positive lift on both clicks and bookings → run a full A/B test on CVR uplift before launch.
- Significant negative lift on either → kill the candidate; no A/B needed.
- Mixed signals (click-lift +, booking-lift −) → flag for review; the candidate may be exploiting click-bait without increasing bookings.
Why this works (sensitivity mechanism)¶
Interleaving converts between-user variance (the variance that drowns subtle ranking-change signals in traditional A/B testing) into within-user paired variance. Each user's personal engagement tendency cancels between A and B because they see both on the same search. The statistical win is the same as paired-sample vs two-sample t-tests in classical statistics. See concepts/test-sensitivity.
Operational outcome (Expedia, 2026-02-17)¶
Expedia Group Tech's lodging-search team reports:
- On two deliberately-deteriorated treatments (pinning a random property to positions 5–10; reshuffling top slots), interleaving detects the regression "within a few days of data taking."
- A/B testing on CVR uplift "fails to detect the negative effect of random pinning even with the full sample size."
- Click events give a statistically-significant negative result "already after the first day of data taking" on the worst treatments.
Variants and tuning¶
- Interleaving variant — balanced, team-draft, probabilistic. Team-draft is the classical position-bias-safest default. Expedia does not disclose their choice.
- Aggregation level — per-search vs per-user. Per-user bounds heavy-searcher influence.
- Tie handling — zero, ±0.5, or other fractional encodings for the winning indicator.
- Significance test — t-test (fast) or bootstrap (non-parametric). See patterns/t-test-over-bootstrap for when each applies.
- Event types tracked — clicks only, bookings only, or both (recommended).
- Minimum lift threshold — not every statistically-significant lift is practically significant; set a floor.
Pre-conditions¶
- Ranking experiments only. Treatments must be slot-composable. Does not apply to UI / pricing / flow changes.
- Same candidate pool. A and B must rank the same underlying inventory; new-inventory A vs old-inventory B is not interleavable.
- Per-slot provenance channel. The serving stack must tag and log each slot's source ranking end-to-end.
- Sufficient traffic for CLT. At low QPS the t-test's normality assumption may not hold; use bootstrap at least until equivalence is verified for the surface.
Caveats¶
- Position bias. If A and B win different positions systematically, position bias confounds attribution. Use team-draft or balanced variants; don't compute naive slot-position-agnostic attribution.
- Blast radius. Every impressed user sees some of both rankings — a bad candidate exposes every user to some slots of it, unlike A/B where only the bad-arm cohort is affected.
- Direction, not magnitude. Interleaving wins are "A > B" signals, not "A converts 2.3 % more than B." Launch decisions still need A/B CVR uplift for revenue accounting.
- Click ≠ booking ≠ revenue. Divergent click-lift and booking-lift signals need human judgement; click-bait exploitation is a known failure mode.
- Not for very low-traffic surfaces. Discrete winning-indicator distribution violates t-test CLT assumption at small n; bootstrap is safer but also often unreachable-fast.
Relation to other patterns¶
- patterns/ab-test-rollout — complementary downstream layer; used for launch decisions after interleaving screens pass.
- patterns/t-test-over-bootstrap — the specific significance-test substitution Expedia uses in this pattern.
- patterns/staged-rollout — orthogonal; blast-radius control is still needed even under interleaving's paired design.
Seen in¶
- sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia's lodging-search interleaving framework with click + booking attribution, user-level aggregation, t-test significance as the fast substitute for bootstrap; reported large sensitivity gain over A/B CVR uplift for subtle ranking regressions.
Related¶
- concepts/interleaving-testing — the core technique.
- concepts/lift-metric — the direction-of-preference metric.
- concepts/winning-indicator-t-test — the fast significance test.
- concepts/bootstrap-percentile-method — the non-parametric significance baseline.
- concepts/test-sensitivity — the axis on which this pattern wins.
- concepts/conversion-rate-uplift — the magnitude metric that A/B testing reports but interleaving doesn't.
- systems/expedia-lodging-ranker — the subject-of-measurement system.
- companies/expedia — the company page.