CONCEPT Cited by 1 source

Interleaving testing¶

Definition¶

Interleaving testing is an online-evaluation technique for ranking experiments. Instead of splitting users between two rankings (A/B testing), interleaving splits slots within a single ranked list shown to one user: a subset of slots come from ranking A, others from ranking B. Per-event user feedback (clicks, bookings, add-to-cart, dwell) is then attributed back to the ranking that contributed each slot, and per-search or per-user preference is aggregated across the experiment population.

The output is not an absolute metric like CVR uplift; it's a direction-of-preference signal ( lift). In exchange for losing magnitude, interleaving gains dramatic statistical sensitivity — smaller sample sizes detect real ranking differences, because every user contributes a paired comparison rather than an independent draw from one condition.

Mechanism¶

Two candidate rankings A and B are computed for the same query / user / context.
Interleave slots into one displayed list, tagging each slot with its source (A or B). Common variants from the IR literature:
Balanced interleaving — alternate A and B top-down, resolving duplicates in favour of the higher-ranked list.
Team-draft interleaving — coin flip per slot decides which list "drafts" the next unused result; duplicates resolved randomly.
Probabilistic / optimised interleaving — slot-level probability weighted by expected information gain. Expedia's post does not disclose which variant they use.
Display the interleaved list to the user.
Attribute events (click, booking, view) to the source ranking of the slot the user interacted with.
Declare a per-search winner: ranking with more attributed events wins; equal → tie.
Aggregate wins across searches (or users) into a lift metric.
Test significance via bootstrap or t-test on the distribution of per-search winning indicators.

Comparison to A/B testing¶

	A/B testing	Interleaving
Unit of randomisation	User (or session)	Slot within one list
Variance control	Between-user variance	Within-user (paired)
Metric reported	Absolute uplift (CVR, revenue, engagement)	Direction of preference (lift)
Sensitivity	Low on subtle ranking tweaks	High ("significantly more sensitive"*)
Applicability	Any treatment	Ranking treatments only
Launch decision	Usable directly	Screen only — needs follow-up A/B for revenue case
Sample needed	Large — often weeks	Small — often days

*Quote from Expedia's sources/2026-02-17-expedia-interleaving-for-accelerated-testing|2026-02-17 post.

Why the sensitivity gain¶

Interleaving eliminates between-user variance by making every user a paired comparison. The same user sees both candidates on the same search with the same intent; the user's overall engagement tendency (heavy clicker vs light clicker, repeat customer vs browser) cancels between A and B. In A/B testing, users in the A cohort and users in the B cohort are different people — their personal engagement variance goes into the denominator of the effect-size calculation and drowns small ranking differences.

This is the same reason paired-sample t-tests are more powerful than two-sample t-tests in classical statistics; interleaving applies the paired-design idea to the ranking-experimentation setting.

When to use¶

Ranking screens. You have many candidate ranking changes; you can't afford a multi-week A/B per candidate. Use interleaving to winnow quickly, then A/B the survivors for the launch decision.
Subtle ranking changes. Small tweaks (one-position shifts, minor feature-weight changes) that A/B can't detect with reasonable sample size are often decisively detectable by interleaving.
Deteriorating-treatment detection. Pinning a random property, reshuffling top slots — treatments whose negative effect is too small for A/B CVR to catch but large enough to harm users.

When not to use¶

Non-ranking experiments. UI / pricing / flow / notification experiments don't have slot-composable treatments.
Launch decisions. You need CVR uplift / revenue numbers to justify a launch to the business; interleaving tells you the direction only.
Very low-traffic surfaces. The winning-indicator distribution is discrete and the t-test assumptions may not hold until enough searches are collected; at very low scale bootstrap is safer but also often unreachable.
Treatments that differ in item set, not just ranking. Interleaving assumes both A and B draw from the same candidate pool; if A can show items B can't (e.g., new inventory source), the comparison isn't apples-to-apples.

Caveats¶

Position bias. Slot position strongly affects click probability. If A and B systematically win different positions in the interleaved list, position bias confounds attribution. Team-draft and balanced variants are specifically designed to mitigate this; probabilistic variants explicitly model the click model. The post does not specify the variant.
No magnitude. You learn A > B but not by how much in CVR.
Click ≠ booking ≠ revenue. Different events have different sensitivities and different revenue proximity. Expedia tracks both clicks and bookings separately; clicks are denser and detect faster, bookings are closer to revenue.
Interleaving exposes users to two rankings on one search. If one candidate is dangerously bad, every impressed user sees some of it — unlike A/B where only the bad-arm cohort is affected. (patterns/interleaved-ranking-evaluation notes blast-radius trade.)

Seen in¶

sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia Group Tech's lodging-search application: clicks + bookings attributed separately, lift reported at the user level, significance via winning-indicator t-test as a faster substitute for bootstrap percentile method. Against deliberately-deteriorated rankings (random-property pinning, top-slot reshuffling), interleaving detects the regression within days; A/B on CVR fails to detect pinning at all.

patterns/interleaved-ranking-evaluation — the full experimentation loop built on this concept.
patterns/t-test-over-bootstrap — the significance-step speedup specific to the winning-indicator distribution.
patterns/ab-test-rollout — complementary, not replaced by interleaving; launch decisions still go through A/B for revenue numbers.
concepts/lift-metric — the aggregated preference measurement.
concepts/test-sensitivity — the axis on which interleaving wins.
concepts/conversion-rate-uplift — the axis on which A/B wins.
concepts/customer-driven-metrics — the broader question of which metric to measure; interleaving and A/B answer different sub-questions.