CONCEPT Cited by 1 source
Bootstrap percentile method¶
Definition¶
The bootstrap percentile method is a non-parametric technique for
computing confidence intervals (and thus significance) for a point
estimate. Given a sample of size n:
- Resample the data with replacement to produce a bootstrap sample
of size
n. - Compute the metric (mean, lift, median, whatever) on the bootstrap sample.
- Repeat
Btimes (typically 1,000 to 10,000) to build an empirical distribution of the metric. - Take the 2.5th and 97.5th percentiles of the empirical distribution as the 95 % confidence interval.
If the null value (commonly 0) falls outside the confidence interval, reject the null at the chosen level.
Properties¶
- Non-parametric: makes no distributional assumptions about the underlying data.
- General-purpose: works for any metric whose sampling distribution can be approximated by resampling.
- Slow: requires
B × O(metric computation)work. At production scale (millions of samples, thousands of resamples), this is a significant cost even distributed. - Distributable but not cheap: partition-reducible resampling schemes exist, but the cost is still orders of magnitude above a parametric t-test.
Use in interleaving testing¶
In interleaving experiments, bootstrap of the lift metric is the classical way to decide whether an observed lift is significantly different from zero (Expedia, 2026-02-17). Expedia reports that bootstrap "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion."
For the specific winning-indicator aggregation, they substitute a t-test that yields "virtually the same results ... [but] considerably faster" — motivating the generalised patterns/t-test-over-bootstrap pattern.
When bootstrap remains the right choice¶
- Low sample size. CLT doesn't hold; t-test assumptions break.
- Skewed metrics. Medians, quantiles, ratios where the sampling distribution isn't approximately normal.
- Novel metric definitions where no closed-form standard error is known.
- High-stakes launches where the compute cost of bootstrap is a rounding error compared to the business cost of a bad launch decision.
- Validating t-test equivalence. Run bootstrap once on a new experiment surface to confirm CLT applies, then switch to t-test for production.
When to prefer parametric alternatives¶
- Large sample size. CLT kicks in; t-test or z-test approximates bootstrap to within noise.
- Standard metric definitions (means, differences of means) with well-known standard errors.
- High-throughput experimentation platforms where every saved compute-second compounds over thousands of experiments.
Caveats¶
- Bootstrap isn't assumption-free — it assumes the observed sample is representative of the population and that the metric is asymptotically pivotal. Neither is guaranteed.
- Number of resamples matters — too few and the percentile estimates are noisy; 10,000 is a common production floor.
- Cluster correlation — naive resampling of rows breaks if rows are grouped (e.g., searches from one user); cluster-level resampling is required.
Related¶
- concepts/winning-indicator-t-test — the fast parametric substitute in the interleaving-significance use case.
- concepts/interleaving-testing — the experimentation technique whose output is tested for significance.
- concepts/lift-metric — the metric whose confidence interval bootstrap (or t-test) computes.
- patterns/t-test-over-bootstrap — the generalised pattern of substituting a parametric test for bootstrap when performance matters and CLT applies.
Seen in¶
- sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia's baseline significance-testing approach for interleaving, replaced by the faster winning-indicator t-test at production scale.