CONCEPT Cited by 1 source
Test sensitivity¶
Definition¶
Test sensitivity (closely related to statistical power) is the ability of an experimentation design to detect a real effect of a given size, at a given significance level, with a given sample size. A highly sensitive test detects small effects with less data; a low-sensitivity test needs more data (or misses the effect entirely).
For online product experiments:
- High sensitivity → detect subtle ranking / UI / algorithm changes in days with modest traffic.
- Low sensitivity → even large experiments at full traffic may fail to reach significance.
Why it matters in practice¶
Experimentation is a bottleneck in product velocity. Sensitivity determines:
- Throughput — how many candidate changes can be evaluated per unit time.
- Floor of detectable change — some changes may be real but not economically detectable under low-sensitivity designs.
- Cost — long experiments tie up exposure that could be testing other candidates.
The interleaving sensitivity gain¶
Interleaving is dramatically more sensitive than classical A/B testing on CVR uplift for ranking experiments because it converts between-user variance into within-user (paired) variance — the same statistical win that makes paired-sample t-tests more powerful than two-sample t-tests in classical statistics.
Expedia's 2026-02-17 post reports two deliberately-deteriorated ranking treatments as worked examples:
- Pinning a random property to a slot between positions 5 and 10. Expected impact: minimal but real regression.
- Randomly reshuffling a number of top slots. Expected impact: moderate regression.
Results:
- Interleaving "correctly detects the negative effects ... within a few days of data taking" for both treatments.
- A/B testing on CVR "fails to detect the negative effect of random pinning even with the full sample size."
This isn't a small factor — it's the difference between detectable and undetectable at realistic sample sizes.
The sensitivity / magnitude trade¶
Higher sensitivity usually comes with a cost — the metric measured is often further from revenue:
| Metric | Sensitivity | Revenue proximity |
|---|---|---|
| Click-based lift (interleaving) | Highest | Lowest |
| Booking-based lift (interleaving) | High | Medium |
| CVR uplift (A/B) | Low on subtle changes | Highest |
| Revenue uplift (A/B) | Lowest | Highest |
Expedia tracks click and booking lift separately specifically so the fast-detection and revenue-closest signals are both visible. Early experiments can be killed on click-lift regression; launch decisions go through the revenue-closer signal.
Factors that determine sensitivity¶
- Paired vs unpaired design. Paired designs (interleaving, within-subject A/B) gain 2–10× sensitivity by removing between-unit variance.
- Sample size. Sensitivity scales ~√n; doubling traffic adds ~40 % detectable-effect-size.
- Metric choice. Continuous metrics (click probability) are more sensitive than discrete rare events (bookings).
- Baseline variance. Low-variance metrics are more sensitive.
- Variance-reduction techniques — CUPED, stratification, covariate adjustment — trade design-complexity for sensitivity.
- Effect size. Nothing detects a zero effect; ceiling is the true effect being tested.
Caveats¶
- Sensitivity isn't everything. A sensitive metric that's disconnected from the business outcome (e.g., dwell-time lift when revenue falls) can mislead.
- More sensitive tests also detect more false signals if the significance threshold isn't adjusted for multiple comparisons — running hundreds of interleaving experiments per week without Bonferroni / FDR control inflates false-positive rates.
- Sensitivity ≠ repeatability. Two sensitive tests on the same change can disagree if traffic composition or inventory has shifted.
Seen in¶
- sources/2026-02-17-expedia-interleaving-for-accelerated-testing — sensitivity reported as the headline reason to prefer interleaving over A/B for subtle ranking changes; Figure 5 of the post shows CI shrinkage vs sample size for lift (interleaving) vs CVR uplift (A/B), with interleaving converging substantially faster.
Related¶
- concepts/interleaving-testing — the technique that achieves the sensitivity gain for ranking experiments.
- concepts/lift-metric — the direction-only metric whose sensitivity is high.
- concepts/conversion-rate-uplift — the magnitude metric whose sensitivity is low for subtle ranking changes.
- patterns/interleaved-ranking-evaluation — the end-to-end application of interleaving for sensitivity.
- patterns/ab-test-rollout — the classical baseline whose sensitivity interleaving improves on (for ranking experiments).
- concepts/customer-driven-metrics — the broader question of which metric to measure; sensitivity is one axis, revenue-proximity is another.