CONCEPT Cited by 1 source
Interleaving testing¶
Definition¶
Interleaving testing is an online-evaluation technique for ranking experiments. Instead of splitting users between two rankings (A/B testing), interleaving splits slots within a single ranked list shown to one user: a subset of slots come from ranking A, others from ranking B. Per-event user feedback (clicks, bookings, add-to-cart, dwell) is then attributed back to the ranking that contributed each slot, and per-search or per-user preference is aggregated across the experiment population.
The output is not an absolute metric like CVR uplift; it's a direction-of-preference signal ( lift). In exchange for losing magnitude, interleaving gains dramatic statistical sensitivity — smaller sample sizes detect real ranking differences, because every user contributes a paired comparison rather than an independent draw from one condition.
Mechanism¶
- Two candidate rankings A and B are computed for the same query / user / context.
- Interleave slots into one displayed list, tagging each slot with its
source (
AorB). Common variants from the IR literature: - Balanced interleaving — alternate A and B top-down, resolving duplicates in favour of the higher-ranked list.
- Team-draft interleaving — coin flip per slot decides which list "drafts" the next unused result; duplicates resolved randomly.
- Probabilistic / optimised interleaving — slot-level probability weighted by expected information gain. Expedia's post does not disclose which variant they use.
- Display the interleaved list to the user.
- Attribute events (click, booking, view) to the source ranking of the slot the user interacted with.
- Declare a per-search winner: ranking with more attributed events wins; equal → tie.
- Aggregate wins across searches (or users) into a lift metric.
- Test significance via bootstrap or t-test on the distribution of per-search winning indicators.
Comparison to A/B testing¶
| A/B testing | Interleaving | |
|---|---|---|
| Unit of randomisation | User (or session) | Slot within one list |
| Variance control | Between-user variance | Within-user (paired) |
| Metric reported | Absolute uplift (CVR, revenue, engagement) | Direction of preference (lift) |
| Sensitivity | Low on subtle ranking tweaks | High ("significantly more sensitive"*) |
| Applicability | Any treatment | Ranking treatments only |
| Launch decision | Usable directly | Screen only — needs follow-up A/B for revenue case |
| Sample needed | Large — often weeks | Small — often days |
*Quote from Expedia's sources/2026-02-17-expedia-interleaving-for-accelerated-testing|2026-02-17 post.
Why the sensitivity gain¶
Interleaving eliminates between-user variance by making every user a paired comparison. The same user sees both candidates on the same search with the same intent; the user's overall engagement tendency (heavy clicker vs light clicker, repeat customer vs browser) cancels between A and B. In A/B testing, users in the A cohort and users in the B cohort are different people — their personal engagement variance goes into the denominator of the effect-size calculation and drowns small ranking differences.
This is the same reason paired-sample t-tests are more powerful than two-sample t-tests in classical statistics; interleaving applies the paired-design idea to the ranking-experimentation setting.
When to use¶
- Ranking screens. You have many candidate ranking changes; you can't afford a multi-week A/B per candidate. Use interleaving to winnow quickly, then A/B the survivors for the launch decision.
- Subtle ranking changes. Small tweaks (one-position shifts, minor feature-weight changes) that A/B can't detect with reasonable sample size are often decisively detectable by interleaving.
- Deteriorating-treatment detection. Pinning a random property, reshuffling top slots — treatments whose negative effect is too small for A/B CVR to catch but large enough to harm users.
When not to use¶
- Non-ranking experiments. UI / pricing / flow / notification experiments don't have slot-composable treatments.
- Launch decisions. You need CVR uplift / revenue numbers to justify a launch to the business; interleaving tells you the direction only.
- Very low-traffic surfaces. The winning-indicator distribution is discrete and the t-test assumptions may not hold until enough searches are collected; at very low scale bootstrap is safer but also often unreachable.
- Treatments that differ in item set, not just ranking. Interleaving assumes both A and B draw from the same candidate pool; if A can show items B can't (e.g., new inventory source), the comparison isn't apples-to-apples.
Caveats¶
- Position bias. Slot position strongly affects click probability. If A and B systematically win different positions in the interleaved list, position bias confounds attribution. Team-draft and balanced variants are specifically designed to mitigate this; probabilistic variants explicitly model the click model. The post does not specify the variant.
- No magnitude. You learn A > B but not by how much in CVR.
- Click ≠ booking ≠ revenue. Different events have different sensitivities and different revenue proximity. Expedia tracks both clicks and bookings separately; clicks are denser and detect faster, bookings are closer to revenue.
- Interleaving exposes users to two rankings on one search. If one candidate is dangerously bad, every impressed user sees some of it — unlike A/B where only the bad-arm cohort is affected. (patterns/interleaved-ranking-evaluation notes blast-radius trade.)
Seen in¶
- sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia Group Tech's lodging-search application: clicks + bookings attributed separately, lift reported at the user level, significance via winning-indicator t-test as a faster substitute for bootstrap percentile method. Against deliberately-deteriorated rankings (random-property pinning, top-slot reshuffling), interleaving detects the regression within days; A/B on CVR fails to detect pinning at all.
Related¶
- patterns/interleaved-ranking-evaluation — the full experimentation loop built on this concept.
- patterns/t-test-over-bootstrap — the significance-step speedup specific to the winning-indicator distribution.
- patterns/ab-test-rollout — complementary, not replaced by interleaving; launch decisions still go through A/B for revenue numbers.
- concepts/lift-metric — the aggregated preference measurement.
- concepts/test-sensitivity — the axis on which interleaving wins.
- concepts/conversion-rate-uplift — the axis on which A/B wins.
- concepts/customer-driven-metrics — the broader question of which metric to measure; interleaving and A/B answer different sub-questions.