Skip to content

CONCEPT Cited by 1 source

Test sensitivity

Definition

Test sensitivity (closely related to statistical power) is the ability of an experimentation design to detect a real effect of a given size, at a given significance level, with a given sample size. A highly sensitive test detects small effects with less data; a low-sensitivity test needs more data (or misses the effect entirely).

For online product experiments:

  • High sensitivity → detect subtle ranking / UI / algorithm changes in days with modest traffic.
  • Low sensitivity → even large experiments at full traffic may fail to reach significance.

Why it matters in practice

Experimentation is a bottleneck in product velocity. Sensitivity determines:

  • Throughput — how many candidate changes can be evaluated per unit time.
  • Floor of detectable change — some changes may be real but not economically detectable under low-sensitivity designs.
  • Cost — long experiments tie up exposure that could be testing other candidates.

The interleaving sensitivity gain

Interleaving is dramatically more sensitive than classical A/B testing on CVR uplift for ranking experiments because it converts between-user variance into within-user (paired) variance — the same statistical win that makes paired-sample t-tests more powerful than two-sample t-tests in classical statistics.

Expedia's 2026-02-17 post reports two deliberately-deteriorated ranking treatments as worked examples:

  1. Pinning a random property to a slot between positions 5 and 10. Expected impact: minimal but real regression.
  2. Randomly reshuffling a number of top slots. Expected impact: moderate regression.

Results:

  • Interleaving "correctly detects the negative effects ... within a few days of data taking" for both treatments.
  • A/B testing on CVR "fails to detect the negative effect of random pinning even with the full sample size."

This isn't a small factor — it's the difference between detectable and undetectable at realistic sample sizes.

The sensitivity / magnitude trade

Higher sensitivity usually comes with a cost — the metric measured is often further from revenue:

Metric Sensitivity Revenue proximity
Click-based lift (interleaving) Highest Lowest
Booking-based lift (interleaving) High Medium
CVR uplift (A/B) Low on subtle changes Highest
Revenue uplift (A/B) Lowest Highest

Expedia tracks click and booking lift separately specifically so the fast-detection and revenue-closest signals are both visible. Early experiments can be killed on click-lift regression; launch decisions go through the revenue-closer signal.

Factors that determine sensitivity

  • Paired vs unpaired design. Paired designs (interleaving, within-subject A/B) gain 2–10× sensitivity by removing between-unit variance.
  • Sample size. Sensitivity scales ~√n; doubling traffic adds ~40 % detectable-effect-size.
  • Metric choice. Continuous metrics (click probability) are more sensitive than discrete rare events (bookings).
  • Baseline variance. Low-variance metrics are more sensitive.
  • Variance-reduction techniques — CUPED, stratification, covariate adjustment — trade design-complexity for sensitivity.
  • Effect size. Nothing detects a zero effect; ceiling is the true effect being tested.

Caveats

  • Sensitivity isn't everything. A sensitive metric that's disconnected from the business outcome (e.g., dwell-time lift when revenue falls) can mislead.
  • More sensitive tests also detect more false signals if the significance threshold isn't adjusted for multiple comparisons — running hundreds of interleaving experiments per week without Bonferroni / FDR control inflates false-positive rates.
  • Sensitivity ≠ repeatability. Two sensitive tests on the same change can disagree if traffic composition or inventory has shifted.

Seen in

  • sources/2026-02-17-expedia-interleaving-for-accelerated-testing — sensitivity reported as the headline reason to prefer interleaving over A/B for subtle ranking changes; Figure 5 of the post shows CI shrinkage vs sample size for lift (interleaving) vs CVR uplift (A/B), with interleaving converging substantially faster.
Last updated · 200 distilled / 1,178 read