Skip to content

PATTERN Cited by 1 source

Interleaved ranking evaluation

Intent

Evaluate a candidate ranking change faster and more sensitively than A/B testing on CVR uplift can. Use as a screening layer to winnow candidate changes before A/B validation, not as a replacement for launch-decision A/B testing.

Structure

  1. Produce two rankings A and B for the same query / user / context from the same candidate pool. Both must be populated enough to interleave sensibly.
  2. Interleave their slots into one displayed list per impression, tagging each slot's source ranking. Variant selection (balanced, team-draft, probabilistic) affects position-bias correction but not the pattern shape.
  3. Display the interleaved list to the user — one user = one list with contributions from both A and B.
  4. Attribute events to source ranking:
  5. Click events (e.g., property-detail-page view) — high frequency, fast detection.
  6. Booking events (completed transactions) — rare, close to revenue. Track both independently.
  7. Compute per-search winning indicator: +1 if A's attributed events > B's, −1 if reverse, 0 (or fractional) for ties.
  8. (Optional) Aggregate to per-user by majority-vote per user — Expedia's default, because it bounds the influence of heavy-searcher users.
  9. Aggregate into lift. Report separately for clicks and bookings.
  10. Test significance with winning-indicator t-test (fast, CLT-backed) or bootstrap percentile (non-parametric, slower).
  11. Promote or kill the candidate:
  12. Significant positive lift on both clicks and bookings → run a full A/B test on CVR uplift before launch.
  13. Significant negative lift on either → kill the candidate; no A/B needed.
  14. Mixed signals (click-lift +, booking-lift −) → flag for review; the candidate may be exploiting click-bait without increasing bookings.

Why this works (sensitivity mechanism)

Interleaving converts between-user variance (the variance that drowns subtle ranking-change signals in traditional A/B testing) into within-user paired variance. Each user's personal engagement tendency cancels between A and B because they see both on the same search. The statistical win is the same as paired-sample vs two-sample t-tests in classical statistics. See concepts/test-sensitivity.

Operational outcome (Expedia, 2026-02-17)

Expedia Group Tech's lodging-search team reports:

  • On two deliberately-deteriorated treatments (pinning a random property to positions 5–10; reshuffling top slots), interleaving detects the regression "within a few days of data taking."
  • A/B testing on CVR uplift "fails to detect the negative effect of random pinning even with the full sample size."
  • Click events give a statistically-significant negative result "already after the first day of data taking" on the worst treatments.

Variants and tuning

  • Interleaving variant — balanced, team-draft, probabilistic. Team-draft is the classical position-bias-safest default. Expedia does not disclose their choice.
  • Aggregation level — per-search vs per-user. Per-user bounds heavy-searcher influence.
  • Tie handling — zero, ±0.5, or other fractional encodings for the winning indicator.
  • Significance test — t-test (fast) or bootstrap (non-parametric). See patterns/t-test-over-bootstrap for when each applies.
  • Event types tracked — clicks only, bookings only, or both (recommended).
  • Minimum lift threshold — not every statistically-significant lift is practically significant; set a floor.

Pre-conditions

  • Ranking experiments only. Treatments must be slot-composable. Does not apply to UI / pricing / flow changes.
  • Same candidate pool. A and B must rank the same underlying inventory; new-inventory A vs old-inventory B is not interleavable.
  • Per-slot provenance channel. The serving stack must tag and log each slot's source ranking end-to-end.
  • Sufficient traffic for CLT. At low QPS the t-test's normality assumption may not hold; use bootstrap at least until equivalence is verified for the surface.

Caveats

  • Position bias. If A and B win different positions systematically, position bias confounds attribution. Use team-draft or balanced variants; don't compute naive slot-position-agnostic attribution.
  • Blast radius. Every impressed user sees some of both rankings — a bad candidate exposes every user to some slots of it, unlike A/B where only the bad-arm cohort is affected.
  • Direction, not magnitude. Interleaving wins are "A > B" signals, not "A converts 2.3 % more than B." Launch decisions still need A/B CVR uplift for revenue accounting.
  • Click ≠ booking ≠ revenue. Divergent click-lift and booking-lift signals need human judgement; click-bait exploitation is a known failure mode.
  • Not for very low-traffic surfaces. Discrete winning-indicator distribution violates t-test CLT assumption at small n; bootstrap is safer but also often unreachable-fast.

Relation to other patterns

Seen in

  • sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia's lodging-search interleaving framework with click + booking attribution, user-level aggregation, t-test significance as the fast substitute for bootstrap; reported large sensitivity gain over A/B CVR uplift for subtle ranking regressions.
Last updated · 200 distilled / 1,178 read