CONCEPT Cited by 1 source

Test sensitivity¶

Definition¶

Test sensitivity (closely related to statistical power) is the ability of an experimentation design to detect a real effect of a given size, at a given significance level, with a given sample size. A highly sensitive test detects small effects with less data; a low-sensitivity test needs more data (or misses the effect entirely).

For online product experiments:

High sensitivity → detect subtle ranking / UI / algorithm changes in days with modest traffic.
Low sensitivity → even large experiments at full traffic may fail to reach significance.

Why it matters in practice¶

Experimentation is a bottleneck in product velocity. Sensitivity determines:

Throughput — how many candidate changes can be evaluated per unit time.
Floor of detectable change — some changes may be real but not economically detectable under low-sensitivity designs.
Cost — long experiments tie up exposure that could be testing other candidates.

The interleaving sensitivity gain¶

Interleaving is dramatically more sensitive than classical A/B testing on CVR uplift for ranking experiments because it converts between-user variance into within-user (paired) variance — the same statistical win that makes paired-sample t-tests more powerful than two-sample t-tests in classical statistics.

Expedia's 2026-02-17 post reports two deliberately-deteriorated ranking treatments as worked examples:

Pinning a random property to a slot between positions 5 and 10. Expected impact: minimal but real regression.
Randomly reshuffling a number of top slots. Expected impact: moderate regression.

Results:

Interleaving "correctly detects the negative effects ... within a few days of data taking" for both treatments.
A/B testing on CVR "fails to detect the negative effect of random pinning even with the full sample size."

This isn't a small factor — it's the difference between detectable and undetectable at realistic sample sizes.

The sensitivity / magnitude trade¶

Higher sensitivity usually comes with a cost — the metric measured is often further from revenue:

Metric	Sensitivity	Revenue proximity
Click-based lift (interleaving)	Highest	Lowest
Booking-based lift (interleaving)	High	Medium
CVR uplift (A/B)	Low on subtle changes	Highest
Revenue uplift (A/B)	Lowest	Highest

Expedia tracks click and booking lift separately specifically so the fast-detection and revenue-closest signals are both visible. Early experiments can be killed on click-lift regression; launch decisions go through the revenue-closer signal.

Factors that determine sensitivity¶

Paired vs unpaired design. Paired designs (interleaving, within-subject A/B) gain 2–10× sensitivity by removing between-unit variance.
Sample size. Sensitivity scales ~√n; doubling traffic adds ~40 % detectable-effect-size.
Metric choice. Continuous metrics (click probability) are more sensitive than discrete rare events (bookings).
Baseline variance. Low-variance metrics are more sensitive.
Variance-reduction techniques — CUPED, stratification, covariate adjustment — trade design-complexity for sensitivity.
Effect size. Nothing detects a zero effect; ceiling is the true effect being tested.

Caveats¶

Sensitivity isn't everything. A sensitive metric that's disconnected from the business outcome (e.g., dwell-time lift when revenue falls) can mislead.
More sensitive tests also detect more false signals if the significance threshold isn't adjusted for multiple comparisons — running hundreds of interleaving experiments per week without Bonferroni / FDR control inflates false-positive rates.
Sensitivity ≠ repeatability. Two sensitive tests on the same change can disagree if traffic composition or inventory has shifted.

Seen in¶

sources/2026-02-17-expedia-interleaving-for-accelerated-testing — sensitivity reported as the headline reason to prefer interleaving over A/B for subtle ranking changes; Figure 5 of the post shows CI shrinkage vs sample size for lift (interleaving) vs CVR uplift (A/B), with interleaving converging substantially faster.

concepts/interleaving-testing — the technique that achieves the sensitivity gain for ranking experiments.
concepts/lift-metric — the direction-only metric whose sensitivity is high.
concepts/conversion-rate-uplift — the magnitude metric whose sensitivity is low for subtle ranking changes.
patterns/interleaved-ranking-evaluation — the end-to-end application of interleaving for sensitivity.
patterns/ab-test-rollout — the classical baseline whose sensitivity interleaving improves on (for ranking experiments).
concepts/customer-driven-metrics — the broader question of which metric to measure; sensitivity is one axis, revenue-proximity is another.