Expedia — Interleaving for Accelerated Testing (2026-02-17)¶
Summary¶
Expedia Group's lodging search team uses interleaving — a ranking-experimentation technique that mixes results from two candidate rankings (A and B) into one list shown to a single user — as an accelerated alternative to A/B testing for search-ranking changes. Per-search user events (property-detail-page clicks and booking transactions) are attributed back to the source ranking that contributed each displayed slot; each search gets a winning variant (the ranking whose attributed events were higher, with ties explicit). A lift metric aggregates wins across searches (or users); lift = 0 means no preference, positive prefers A, negative prefers B. Significance is established via a t-test on the distribution of winning indicators — yielding "virtually the same results as the bootstrapping [percentile method]" (concepts/bootstrap-percentile-method) but "considerably faster." Against deliberately-deteriorated ranking treatments (random-property pinning between positions 5–10; random reshuffling of top slots), interleaving "correctly detects the negative effects ... within a few days of data taking," while a conventional A/B test on CVR uplift "fails to detect the negative effect of random pinning even with the full sample size" — a large sensitivity gain. Click events were more sensitive than bookings: "statistically significant negative result already after the first day of data taking."
Key takeaways¶
- Interleaving ≠ A/B testing. A/B splits users; interleaving splits slots within a single list shown to one user, then attributes per-event preference. The metric measures direction of user preference, not the magnitude of CVR uplift a traditional A/B test would measure. (concepts/interleaving-testing)
- Two event types tracked separately: property-detail-page views (clicks) and booking transactions. Tracking both "improves our understanding of the impact of rankings to both conversion and click-through rates." Clicks are denser and more sensitive; bookings are rarer but closer to revenue.
- Lift metric — "lift metric equals 0 when A and B win an equal number of times, indicating no user preference between the two rankings" — normalised for ties; results "do not strongly depend on the normalization method." Default reporting is at the user level (comparing users-who-preferred-A vs users-who-preferred-B with ties accounting for users-without-preferences).
- Fast significance via t-test on winning indicators. Bootstrapping the percentile method "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion." A t-test against zero of the winning-indicator distribution "yields virtually the same results as the bootstrapping approach but is considerably faster." (patterns/t-test-over-bootstrap)
- Sensitivity gain is large and production-visible. Against two deliberately-degraded treatments — pinning a random property to slots 5–10, and reshuffling the top slots — interleaving detects the regression "within a few days of data taking"; A/B testing on CVR "fails to detect the negative effect of random pinning even with the full sample size."
- Click events > booking events for early detection. Click is a denser, higher-frequency signal; for the deteriorating treatments click events "show a statistically significant negative result already after the first day." Bookings eventually catch up but take longer. The split is valuable because product decisions sometimes require the revenue-closest signal even if it's slower.
- Interleaving is for screening, not full rollout. The post is explicit that interleaving measures direction, not absolute CVR uplift. Production rollouts still measure revenue impact with A/B rollouts — but the winnowing step (which candidate rankings are worth running a full A/B on at all) is dramatically faster.
Systems extracted¶
- systems/expedia-lodging-ranker — the subject-of-measurement: Expedia's lodging search-results ranking algorithm. The post doesn't disclose its architecture (features, model family, serving stack) — only the experimentation harness sitting around it. Canonical wiki reference for the interleaved ranking evaluation pattern at Expedia.
Concepts extracted¶
- concepts/interleaving-testing — the technique: mix two rankings' results into one displayed list per user, attribute per-event preference back to the source ranking, aggregate across searches/users. Standard variants in the information-retrieval literature (team-draft interleaving, balanced interleaving, probabilistic) are not explicitly named in the post; Expedia describes the technique at the abstract level. Canonical wiki reference.
- concepts/lift-metric — the normalised wins-minus-losses measurement with explicit tie accounting. Positive lift favours A, negative favours B, zero means no preference. Reportable per-search or per-user (user level is Expedia's default).
- concepts/winning-indicator-t-test — fast significance test: t-test against zero of the distribution of winning indicators across searches (or users). Trades strict non-parametric guarantees for speed; Expedia reports near-identical results to bootstrap.
- concepts/bootstrap-percentile-method — the non-parametric significance-testing baseline: resample with replacement, compute the metric per resample, take empirical 2.5/97.5 percentiles as the confidence interval. "Works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion."
- concepts/test-sensitivity — statistical power to detect a real effect. The post contrasts interleaving ("significantly more sensitive than A/B testing") with A/B testing of CVR uplift ("fails to detect the negative effect of random pinning even with the full sample size"). Not to be confused with metric proximity to revenue — interleaving trades revenue-proximity for sensitivity.
- concepts/conversion-rate-uplift — what a standard A/B test measures: absolute change in CVR (conversion rate) between treatment and control. Expedia's critique: for subtle ranking changes, the CVR uplift signal is too small to reach significance before the experiment is abandoned.
Patterns extracted¶
- patterns/interleaved-ranking-evaluation — the full loop: produce two candidate rankings, interleave their results into one list per user, attribute per-event preference to the source ranking, aggregate winning variants into a lift metric, t-test the distribution of winning indicators against zero. Use click events for early detection and booking events for closer-to-revenue signal; track both separately.
- patterns/t-test-over-bootstrap — the specific substitution Expedia highlights: replace bootstrap-percentile confidence intervals with a t-test on the winning-indicator distribution for "virtually the same results ... [but] considerably faster." Works here because the per-search winning indicator has a well-defined mean whose distribution approaches normal by CLT at the scale of a search platform.
Operational numbers¶
- Treatments tested: (i) pin a random property to a slot between positions 5–10; (ii) randomly reshuffle the top slots.
- Time-to-detect — interleaving: "within a few days of data taking" for both treatments. For clicks: "statistically significant negative result already after the first day."
- Time-to-detect — A/B on CVR: "fails to detect the negative effect of random pinning even with the full sample size."
- Sample-size scaling: Figure 5 in the post plots confidence intervals for lift (interleaving) vs CVR uplift (A/B) as a function of sample size — interleaving's CI shrinks much faster than A/B's, for the same exposure.
- Normalisation sensitivity: "the results do not strongly depend on the normalization method" for the lift metric's tie-handling.
- No disclosure of: QPS / daily search volume, CVR baselines, absolute t-statistic values, confidence levels used, lodging-ranker model family / features / serving latency.
Caveats¶
- Technique restricted to ranking experiments. Interleaving requires two candidate rankings whose results can be mixed into a single list per user. It doesn't apply to UI / flow / pricing experiments where the treatment is not slot-composable.
- Direction, not magnitude. Interleaving tells you which ranking users prefer; it doesn't measure the CVR uplift business case for a launch. Production rollouts still need A/B rollouts for revenue accounting.
- Tracking bias. Attributing clicks / bookings to the source ranking requires a per-slot provenance channel. If slots from A and B differ in position (which they always do under interleaving), position-bias confounds the attribution — standard interleaving literature addresses this with team-draft or balanced-interleaving variants, but Expedia's post does not specify which variant they use.
- T-test assumes approximately-normal means. At small scale the winning indicator distribution is highly discrete (−1 / 0 / +1 per search with tie cases in between). The t-test's near-identical results to bootstrap are reported at production scale; the claim may not hold for early-stage or low-traffic product surfaces.
- Click ≠ booking ≠ revenue. The click/booking split is reported approximately, but Expedia does not disclose whether they trust click-based interleaving wins enough to launch from them, or always require bookings to agree before launch.
- No disclosure of the interleaving variant, of the exact t-test statistic formulation, nor of the sample sizes at which the t-test-vs-bootstrap equivalence was verified.
Source¶
- Original: https://medium.com/expedia-group-tech/interleaving-for-accelerated-testing-75adc644027b?source=rss----38998a53046f---4
- Raw markdown:
raw/expedia/2026-02-17-interleaving-for-accelerated-testing-e4435b36.md
Related¶
- companies/expedia — Expedia Group Tech company page.
- concepts/interleaving-testing — the core technique.
- concepts/lift-metric / concepts/winning-indicator-t-test / concepts/bootstrap-percentile-method — the statistical machinery.
- concepts/test-sensitivity / concepts/conversion-rate-uplift — why interleaving wins: more sensitive than A/B on subtle ranking changes.
- patterns/interleaved-ranking-evaluation — the end-to-end experimentation loop.
- patterns/t-test-over-bootstrap — the speed-vs-assumptions trade in the significance-testing step.
- patterns/ab-test-rollout — still required for the final launch decision after an interleaving screen passes.
- systems/expedia-lodging-ranker — the subject system.