Skip to content

EXPEDIA 2026-02-17 Tier 3

Read original ↗

Expedia — Interleaving for Accelerated Testing (2026-02-17)

Summary

Expedia Group's lodging search team uses interleaving — a ranking-experimentation technique that mixes results from two candidate rankings (A and B) into one list shown to a single user — as an accelerated alternative to A/B testing for search-ranking changes. Per-search user events (property-detail-page clicks and booking transactions) are attributed back to the source ranking that contributed each displayed slot; each search gets a winning variant (the ranking whose attributed events were higher, with ties explicit). A lift metric aggregates wins across searches (or users); lift = 0 means no preference, positive prefers A, negative prefers B. Significance is established via a t-test on the distribution of winning indicators — yielding "virtually the same results as the bootstrapping [percentile method]" (concepts/bootstrap-percentile-method) but "considerably faster." Against deliberately-deteriorated ranking treatments (random-property pinning between positions 5–10; random reshuffling of top slots), interleaving "correctly detects the negative effects ... within a few days of data taking," while a conventional A/B test on CVR uplift "fails to detect the negative effect of random pinning even with the full sample size" — a large sensitivity gain. Click events were more sensitive than bookings: "statistically significant negative result already after the first day of data taking."

Key takeaways

  1. Interleaving ≠ A/B testing. A/B splits users; interleaving splits slots within a single list shown to one user, then attributes per-event preference. The metric measures direction of user preference, not the magnitude of CVR uplift a traditional A/B test would measure. (concepts/interleaving-testing)
  2. Two event types tracked separately: property-detail-page views (clicks) and booking transactions. Tracking both "improves our understanding of the impact of rankings to both conversion and click-through rates." Clicks are denser and more sensitive; bookings are rarer but closer to revenue.
  3. Lift metric"lift metric equals 0 when A and B win an equal number of times, indicating no user preference between the two rankings" — normalised for ties; results "do not strongly depend on the normalization method." Default reporting is at the user level (comparing users-who-preferred-A vs users-who-preferred-B with ties accounting for users-without-preferences).
  4. Fast significance via t-test on winning indicators. Bootstrapping the percentile method "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion." A t-test against zero of the winning-indicator distribution "yields virtually the same results as the bootstrapping approach but is considerably faster." (patterns/t-test-over-bootstrap)
  5. Sensitivity gain is large and production-visible. Against two deliberately-degraded treatments — pinning a random property to slots 5–10, and reshuffling the top slots — interleaving detects the regression "within a few days of data taking"; A/B testing on CVR "fails to detect the negative effect of random pinning even with the full sample size."
  6. Click events > booking events for early detection. Click is a denser, higher-frequency signal; for the deteriorating treatments click events "show a statistically significant negative result already after the first day." Bookings eventually catch up but take longer. The split is valuable because product decisions sometimes require the revenue-closest signal even if it's slower.
  7. Interleaving is for screening, not full rollout. The post is explicit that interleaving measures direction, not absolute CVR uplift. Production rollouts still measure revenue impact with A/B rollouts — but the winnowing step (which candidate rankings are worth running a full A/B on at all) is dramatically faster.

Systems extracted

  • systems/expedia-lodging-ranker — the subject-of-measurement: Expedia's lodging search-results ranking algorithm. The post doesn't disclose its architecture (features, model family, serving stack) — only the experimentation harness sitting around it. Canonical wiki reference for the interleaved ranking evaluation pattern at Expedia.

Concepts extracted

  • concepts/interleaving-testing — the technique: mix two rankings' results into one displayed list per user, attribute per-event preference back to the source ranking, aggregate across searches/users. Standard variants in the information-retrieval literature (team-draft interleaving, balanced interleaving, probabilistic) are not explicitly named in the post; Expedia describes the technique at the abstract level. Canonical wiki reference.
  • concepts/lift-metric — the normalised wins-minus-losses measurement with explicit tie accounting. Positive lift favours A, negative favours B, zero means no preference. Reportable per-search or per-user (user level is Expedia's default).
  • concepts/winning-indicator-t-test — fast significance test: t-test against zero of the distribution of winning indicators across searches (or users). Trades strict non-parametric guarantees for speed; Expedia reports near-identical results to bootstrap.
  • concepts/bootstrap-percentile-method — the non-parametric significance-testing baseline: resample with replacement, compute the metric per resample, take empirical 2.5/97.5 percentiles as the confidence interval. "Works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion."
  • concepts/test-sensitivity — statistical power to detect a real effect. The post contrasts interleaving ("significantly more sensitive than A/B testing") with A/B testing of CVR uplift ("fails to detect the negative effect of random pinning even with the full sample size"). Not to be confused with metric proximity to revenue — interleaving trades revenue-proximity for sensitivity.
  • concepts/conversion-rate-uplift — what a standard A/B test measures: absolute change in CVR (conversion rate) between treatment and control. Expedia's critique: for subtle ranking changes, the CVR uplift signal is too small to reach significance before the experiment is abandoned.

Patterns extracted

  • patterns/interleaved-ranking-evaluation — the full loop: produce two candidate rankings, interleave their results into one list per user, attribute per-event preference to the source ranking, aggregate winning variants into a lift metric, t-test the distribution of winning indicators against zero. Use click events for early detection and booking events for closer-to-revenue signal; track both separately.
  • patterns/t-test-over-bootstrap — the specific substitution Expedia highlights: replace bootstrap-percentile confidence intervals with a t-test on the winning-indicator distribution for "virtually the same results ... [but] considerably faster." Works here because the per-search winning indicator has a well-defined mean whose distribution approaches normal by CLT at the scale of a search platform.

Operational numbers

  • Treatments tested: (i) pin a random property to a slot between positions 5–10; (ii) randomly reshuffle the top slots.
  • Time-to-detect — interleaving: "within a few days of data taking" for both treatments. For clicks: "statistically significant negative result already after the first day."
  • Time-to-detect — A/B on CVR: "fails to detect the negative effect of random pinning even with the full sample size."
  • Sample-size scaling: Figure 5 in the post plots confidence intervals for lift (interleaving) vs CVR uplift (A/B) as a function of sample size — interleaving's CI shrinks much faster than A/B's, for the same exposure.
  • Normalisation sensitivity: "the results do not strongly depend on the normalization method" for the lift metric's tie-handling.
  • No disclosure of: QPS / daily search volume, CVR baselines, absolute t-statistic values, confidence levels used, lodging-ranker model family / features / serving latency.

Caveats

  • Technique restricted to ranking experiments. Interleaving requires two candidate rankings whose results can be mixed into a single list per user. It doesn't apply to UI / flow / pricing experiments where the treatment is not slot-composable.
  • Direction, not magnitude. Interleaving tells you which ranking users prefer; it doesn't measure the CVR uplift business case for a launch. Production rollouts still need A/B rollouts for revenue accounting.
  • Tracking bias. Attributing clicks / bookings to the source ranking requires a per-slot provenance channel. If slots from A and B differ in position (which they always do under interleaving), position-bias confounds the attribution — standard interleaving literature addresses this with team-draft or balanced-interleaving variants, but Expedia's post does not specify which variant they use.
  • T-test assumes approximately-normal means. At small scale the winning indicator distribution is highly discrete (−1 / 0 / +1 per search with tie cases in between). The t-test's near-identical results to bootstrap are reported at production scale; the claim may not hold for early-stage or low-traffic product surfaces.
  • Click ≠ booking ≠ revenue. The click/booking split is reported approximately, but Expedia does not disclose whether they trust click-based interleaving wins enough to launch from them, or always require bookings to agree before launch.
  • No disclosure of the interleaving variant, of the exact t-test statistic formulation, nor of the sample sizes at which the t-test-vs-bootstrap equivalence was verified.

Source

Last updated · 200 distilled / 1,178 read