CONCEPT Cited by 1 source

Bootstrap percentile method¶

Definition¶

The bootstrap percentile method is a non-parametric technique for computing confidence intervals (and thus significance) for a point estimate. Given a sample of size n:

Resample the data with replacement to produce a bootstrap sample of size n.
Compute the metric (mean, lift, median, whatever) on the bootstrap sample.
Repeat B times (typically 1,000 to 10,000) to build an empirical distribution of the metric.
Take the 2.5th and 97.5th percentiles of the empirical distribution as the 95 % confidence interval.

If the null value (commonly 0) falls outside the confidence interval, reject the null at the chosen level.

Properties¶

Non-parametric: makes no distributional assumptions about the underlying data.
General-purpose: works for any metric whose sampling distribution can be approximated by resampling.
Slow: requires B × O(metric computation) work. At production scale (millions of samples, thousands of resamples), this is a significant cost even distributed.
Distributable but not cheap: partition-reducible resampling schemes exist, but the cost is still orders of magnitude above a parametric t-test.

Use in interleaving testing¶

In interleaving experiments, bootstrap of the lift metric is the classical way to decide whether an observed lift is significantly different from zero (Expedia, 2026-02-17). Expedia reports that bootstrap "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion."

For the specific winning-indicator aggregation, they substitute a t-test that yields "virtually the same results ... [but] considerably faster" — motivating the generalised patterns/t-test-over-bootstrap pattern.

When bootstrap remains the right choice¶

Low sample size. CLT doesn't hold; t-test assumptions break.
Skewed metrics. Medians, quantiles, ratios where the sampling distribution isn't approximately normal.
Novel metric definitions where no closed-form standard error is known.
High-stakes launches where the compute cost of bootstrap is a rounding error compared to the business cost of a bad launch decision.
Validating t-test equivalence. Run bootstrap once on a new experiment surface to confirm CLT applies, then switch to t-test for production.

When to prefer parametric alternatives¶

Large sample size. CLT kicks in; t-test or z-test approximates bootstrap to within noise.
Standard metric definitions (means, differences of means) with well-known standard errors.
High-throughput experimentation platforms where every saved compute-second compounds over thousands of experiments.

Caveats¶

Bootstrap isn't assumption-free — it assumes the observed sample is representative of the population and that the metric is asymptotically pivotal. Neither is guaranteed.
Number of resamples matters — too few and the percentile estimates are noisy; 10,000 is a common production floor.
Cluster correlation — naive resampling of rows breaks if rows are grouped (e.g., searches from one user); cluster-level resampling is required.

concepts/winning-indicator-t-test — the fast parametric substitute in the interleaving-significance use case.
concepts/interleaving-testing — the experimentation technique whose output is tested for significance.
concepts/lift-metric — the metric whose confidence interval bootstrap (or t-test) computes.
patterns/t-test-over-bootstrap — the generalised pattern of substituting a parametric test for bootstrap when performance matters and CLT applies.

Seen in¶

sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia's baseline significance-testing approach for interleaving, replaced by the faster winning-indicator t-test at production scale.