PATTERN Cited by 1 source

T-test over bootstrap for production significance testing¶

Intent¶

Replace the general-purpose but computationally expensive bootstrap percentile method for confidence intervals and significance testing with a CLT-backed t-test when:

The sample size is large enough for the Central Limit Theorem to apply.
The metric's mean is the object of inference.
The compute cost of bootstrap is a recurring bottleneck in the experimentation platform.

Structure¶

Benchmark the proposed t-test against the bootstrap on a representative experiment at the target production scale.
Verify equivalence — t-test and bootstrap should agree on both significance decisions and confidence-interval width to within measurement noise. (Expedia's term: "virtually the same results.")
Document the regime — sample size thresholds, metric distributions, per-surface verification dates.
Deploy the t-test as the production significance test; keep bootstrap as a fallback / validation tool.
Re-verify periodically — each new product surface, traffic segment, or metric definition may have different distributional properties.

Why it works¶

The t-test's assumption is that the sampling distribution of the mean is approximately normal — not that the underlying data is normal. For a metric computed as an average across many independent (or approximately-independent) units, the Central Limit Theorem guarantees approximate normality of the mean at large n.

Bootstrap makes the same calculation non-parametrically by simulation; at large n, the two approaches converge to the same answer. Bootstrap's cost (O(resamples · n) per interval) can dominate an experimentation platform's compute budget; the t-test's cost (O(n) once) is negligible.

Why Expedia highlighted this¶

In their interleaving experimentation harness, Expedia's significance test runs on every experiment readout, across many candidate rankings per week, at production search volume. Bootstrap's O(B · n) cost — even distributed — scaled poorly. Substituting a t-test on the winning-indicator distribution "yields virtually the same results as the bootstrapping approach but is considerably faster."

This pattern generalises: any experimentation platform running bootstrap-based significance at production scale is a candidate for the substitution, provided CLT applies.

Pre-conditions¶

Sample size is large. Hundreds to thousands of units minimum; millions comfortably. CLT's approximation depends on the metric's skewness.
Metric is a mean (or a function approximately linear in means). Differences of means, ratios of means (with delta method), regression-adjusted means — all fit. Medians, quantiles, and max-order statistics don't.
Units are approximately independent, or cluster-adjusted. If per-user correlation is high, the effective sample size is lower and CLT applies to the lower count.
Distributional shape is not pathological. Heavy-tailed metrics (revenue with whales, log-normal response times) need larger samples before CLT kicks in.

Anti-patterns¶

Skipping verification. Never deploy t-test-replacing-bootstrap without running both at target scale and confirming agreement.
Extending the substitution to all metrics. Medians, percentiles, and functions of ranks still need bootstrap or order-statistic CIs — don't apply t-test to them by habit.
Ignoring cluster correlation. User-session correlation artificially inflates effective sample size; both bootstrap and t-test need cluster-level aggregation.
One-time verification. Product surfaces evolve; re-verify when a new metric or segment is introduced.

Operational guidance¶

Cost impact at scale. Bootstrap at 10,000 resamples on millions of rows burns thousands of CPU-seconds per readout; t-test is microseconds. For a platform running hundreds of readouts a day, the substitution is a meaningful compute win.
Keep bootstrap in the toolkit. For novel metric definitions, new product surfaces, or low-traffic segments, bootstrap is the safer first-pass.
Use bootstrap to validate t-test deployment decisions. Once per year, re-run bootstrap on a sample of experiments to confirm the t-test is still equivalent; distribution shape may have drifted.

Relation to other patterns¶

patterns/interleaved-ranking-evaluation — the specific context Expedia deployed this substitution in.
patterns/custom-benchmarking-harness — the same "write an afternoon's worth of Go to replace a heavyweight tool" shape: identify a performance-critical substep, replace with a tailored alternative, verify equivalence.

Seen in¶

sources/2026-02-17-expedia-interleaving-for-accelerated-testing — Expedia Group Tech's lodging-search interleaving framework: winning-indicator t-test substituting for bootstrap percentile method; "virtually the same results ... considerably faster."

concepts/bootstrap-percentile-method — the non-parametric baseline.
concepts/winning-indicator-t-test — the specific t-test applied in Expedia's context.
concepts/interleaving-testing — the experimentation technique whose significance step this pattern speeds up.