Skip to content

CONCEPT Cited by 1 source

Interleaving testing

Definition

Interleaving testing is an online-evaluation technique for ranking experiments. Instead of splitting users between two rankings (A/B testing), interleaving splits slots within a single ranked list shown to one user: a subset of slots come from ranking A, others from ranking B. Per-event user feedback (clicks, bookings, add-to-cart, dwell) is then attributed back to the ranking that contributed each slot, and per-search or per-user preference is aggregated across the experiment population.

The output is not an absolute metric like CVR uplift; it's a direction-of-preference signal ( lift). In exchange for losing magnitude, interleaving gains dramatic statistical sensitivity — smaller sample sizes detect real ranking differences, because every user contributes a paired comparison rather than an independent draw from one condition.

Mechanism

  1. Two candidate rankings A and B are computed for the same query / user / context.
  2. Interleave slots into one displayed list, tagging each slot with its source (A or B). Common variants from the IR literature:
  3. Balanced interleaving — alternate A and B top-down, resolving duplicates in favour of the higher-ranked list.
  4. Team-draft interleaving — coin flip per slot decides which list "drafts" the next unused result; duplicates resolved randomly.
  5. Probabilistic / optimised interleaving — slot-level probability weighted by expected information gain. Expedia's post does not disclose which variant they use.
  6. Display the interleaved list to the user.
  7. Attribute events (click, booking, view) to the source ranking of the slot the user interacted with.
  8. Declare a per-search winner: ranking with more attributed events wins; equal → tie.
  9. Aggregate wins across searches (or users) into a lift metric.
  10. Test significance via bootstrap or t-test on the distribution of per-search winning indicators.

Comparison to A/B testing

A/B testing Interleaving
Unit of randomisation User (or session) Slot within one list
Variance control Between-user variance Within-user (paired)
Metric reported Absolute uplift (CVR, revenue, engagement) Direction of preference (lift)
Sensitivity Low on subtle ranking tweaks High ("significantly more sensitive"*)
Applicability Any treatment Ranking treatments only
Launch decision Usable directly Screen only — needs follow-up A/B for revenue case
Sample needed Large — often weeks Small — often days

*Quote from Expedia's sources/2026-02-17-expedia-interleaving-for-accelerated-testing|2026-02-17 post.

Why the sensitivity gain

Interleaving eliminates between-user variance by making every user a paired comparison. The same user sees both candidates on the same search with the same intent; the user's overall engagement tendency (heavy clicker vs light clicker, repeat customer vs browser) cancels between A and B. In A/B testing, users in the A cohort and users in the B cohort are different people — their personal engagement variance goes into the denominator of the effect-size calculation and drowns small ranking differences.

This is the same reason paired-sample t-tests are more powerful than two-sample t-tests in classical statistics; interleaving applies the paired-design idea to the ranking-experimentation setting.

When to use

  • Ranking screens. You have many candidate ranking changes; you can't afford a multi-week A/B per candidate. Use interleaving to winnow quickly, then A/B the survivors for the launch decision.
  • Subtle ranking changes. Small tweaks (one-position shifts, minor feature-weight changes) that A/B can't detect with reasonable sample size are often decisively detectable by interleaving.
  • Deteriorating-treatment detection. Pinning a random property, reshuffling top slots — treatments whose negative effect is too small for A/B CVR to catch but large enough to harm users.

When not to use

  • Non-ranking experiments. UI / pricing / flow / notification experiments don't have slot-composable treatments.
  • Launch decisions. You need CVR uplift / revenue numbers to justify a launch to the business; interleaving tells you the direction only.
  • Very low-traffic surfaces. The winning-indicator distribution is discrete and the t-test assumptions may not hold until enough searches are collected; at very low scale bootstrap is safer but also often unreachable.
  • Treatments that differ in item set, not just ranking. Interleaving assumes both A and B draw from the same candidate pool; if A can show items B can't (e.g., new inventory source), the comparison isn't apples-to-apples.

Caveats

  • Position bias. Slot position strongly affects click probability. If A and B systematically win different positions in the interleaved list, position bias confounds attribution. Team-draft and balanced variants are specifically designed to mitigate this; probabilistic variants explicitly model the click model. The post does not specify the variant.
  • No magnitude. You learn A > B but not by how much in CVR.
  • Click ≠ booking ≠ revenue. Different events have different sensitivities and different revenue proximity. Expedia tracks both clicks and bookings separately; clicks are denser and detect faster, bookings are closer to revenue.
  • Interleaving exposes users to two rankings on one search. If one candidate is dangerously bad, every impressed user sees some of it — unlike A/B where only the bad-arm cohort is affected. (patterns/interleaved-ranking-evaluation notes blast-radius trade.)

Seen in

Last updated · 200 distilled / 1,178 read