Skip to content

CONCEPT Cited by 1 source

Bootstrap percentile method

Definition

The bootstrap percentile method is a non-parametric technique for computing confidence intervals (and thus significance) for a point estimate. Given a sample of size n:

  1. Resample the data with replacement to produce a bootstrap sample of size n.
  2. Compute the metric (mean, lift, median, whatever) on the bootstrap sample.
  3. Repeat B times (typically 1,000 to 10,000) to build an empirical distribution of the metric.
  4. Take the 2.5th and 97.5th percentiles of the empirical distribution as the 95 % confidence interval.

If the null value (commonly 0) falls outside the confidence interval, reject the null at the chosen level.

Properties

  • Non-parametric: makes no distributional assumptions about the underlying data.
  • General-purpose: works for any metric whose sampling distribution can be approximated by resampling.
  • Slow: requires B × O(metric computation) work. At production scale (millions of samples, thousands of resamples), this is a significant cost even distributed.
  • Distributable but not cheap: partition-reducible resampling schemes exist, but the cost is still orders of magnitude above a parametric t-test.

Use in interleaving testing

In interleaving experiments, bootstrap of the lift metric is the classical way to decide whether an observed lift is significantly different from zero (Expedia, 2026-02-17). Expedia reports that bootstrap "works well as it doesn't make any assumptions on the underlying data, [but] it's slow in practice even if implemented in a distributed fashion."

For the specific winning-indicator aggregation, they substitute a t-test that yields "virtually the same results ... [but] considerably faster" — motivating the generalised patterns/t-test-over-bootstrap pattern.

When bootstrap remains the right choice

  • Low sample size. CLT doesn't hold; t-test assumptions break.
  • Skewed metrics. Medians, quantiles, ratios where the sampling distribution isn't approximately normal.
  • Novel metric definitions where no closed-form standard error is known.
  • High-stakes launches where the compute cost of bootstrap is a rounding error compared to the business cost of a bad launch decision.
  • Validating t-test equivalence. Run bootstrap once on a new experiment surface to confirm CLT applies, then switch to t-test for production.

When to prefer parametric alternatives

  • Large sample size. CLT kicks in; t-test or z-test approximates bootstrap to within noise.
  • Standard metric definitions (means, differences of means) with well-known standard errors.
  • High-throughput experimentation platforms where every saved compute-second compounds over thousands of experiments.

Caveats

  • Bootstrap isn't assumption-free — it assumes the observed sample is representative of the population and that the metric is asymptotically pivotal. Neither is guaranteed.
  • Number of resamples matters — too few and the percentile estimates are noisy; 10,000 is a common production floor.
  • Cluster correlation — naive resampling of rows breaks if rows are grouped (e.g., searches from one user); cluster-level resampling is required.

Seen in

Last updated · 200 distilled / 1,178 read