Skip to content

PATTERN Cited by 1 source

T-test over bootstrap for production significance testing

Intent

Replace the general-purpose but computationally expensive bootstrap percentile method for confidence intervals and significance testing with a CLT-backed t-test when:

  • The sample size is large enough for the Central Limit Theorem to apply.
  • The metric's mean is the object of inference.
  • The compute cost of bootstrap is a recurring bottleneck in the experimentation platform.

Structure

  1. Benchmark the proposed t-test against the bootstrap on a representative experiment at the target production scale.
  2. Verify equivalence — t-test and bootstrap should agree on both significance decisions and confidence-interval width to within measurement noise. (Expedia's term: "virtually the same results.")
  3. Document the regime — sample size thresholds, metric distributions, per-surface verification dates.
  4. Deploy the t-test as the production significance test; keep bootstrap as a fallback / validation tool.
  5. Re-verify periodically — each new product surface, traffic segment, or metric definition may have different distributional properties.

Why it works

The t-test's assumption is that the sampling distribution of the mean is approximately normal — not that the underlying data is normal. For a metric computed as an average across many independent (or approximately-independent) units, the Central Limit Theorem guarantees approximate normality of the mean at large n.

Bootstrap makes the same calculation non-parametrically by simulation; at large n, the two approaches converge to the same answer. Bootstrap's cost (O(resamples · n) per interval) can dominate an experimentation platform's compute budget; the t-test's cost (O(n) once) is negligible.

Why Expedia highlighted this

In their interleaving experimentation harness, Expedia's significance test runs on every experiment readout, across many candidate rankings per week, at production search volume. Bootstrap's O(B · n) cost — even distributed — scaled poorly. Substituting a t-test on the winning-indicator distribution "yields virtually the same results as the bootstrapping approach but is considerably faster."

This pattern generalises: any experimentation platform running bootstrap-based significance at production scale is a candidate for the substitution, provided CLT applies.

Pre-conditions

  • Sample size is large. Hundreds to thousands of units minimum; millions comfortably. CLT's approximation depends on the metric's skewness.
  • Metric is a mean (or a function approximately linear in means). Differences of means, ratios of means (with delta method), regression-adjusted means — all fit. Medians, quantiles, and max-order statistics don't.
  • Units are approximately independent, or cluster-adjusted. If per-user correlation is high, the effective sample size is lower and CLT applies to the lower count.
  • Distributional shape is not pathological. Heavy-tailed metrics (revenue with whales, log-normal response times) need larger samples before CLT kicks in.

Anti-patterns

  • Skipping verification. Never deploy t-test-replacing-bootstrap without running both at target scale and confirming agreement.
  • Extending the substitution to all metrics. Medians, percentiles, and functions of ranks still need bootstrap or order-statistic CIs — don't apply t-test to them by habit.
  • Ignoring cluster correlation. User-session correlation artificially inflates effective sample size; both bootstrap and t-test need cluster-level aggregation.
  • One-time verification. Product surfaces evolve; re-verify when a new metric or segment is introduced.

Operational guidance

  • Cost impact at scale. Bootstrap at 10,000 resamples on millions of rows burns thousands of CPU-seconds per readout; t-test is microseconds. For a platform running hundreds of readouts a day, the substitution is a meaningful compute win.
  • Keep bootstrap in the toolkit. For novel metric definitions, new product surfaces, or low-traffic segments, bootstrap is the safer first-pass.
  • Use bootstrap to validate t-test deployment decisions. Once per year, re-run bootstrap on a sample of experiments to confirm the t-test is still equivalent; distribution shape may have drifted.

Relation to other patterns

Seen in

Last updated · 200 distilled / 1,178 read