CONCEPT Cited by 1 source
Run-to-run variance¶
Definition¶
Run-to-run variance is the variability in measured latency between repeated invocations of the same benchmark on the same binary. It has structural causes (syscall scheduling, page-cache warmth, disk I/O queuing, NUMA placement, CPU frequency scaling) and ambient-system causes (other processes, timer interrupts, memory pressure, background network).
The noise floor below which it becomes impossible to distinguish a real couple-percent performance win from a lucky run.
Why it matters more as code gets faster¶
As the code under test gets faster, the absolute size of real wins shrinks while the absolute size of noise stays roughly constant. A couple-millisecond real win on a 100 ms benchmark is a 2 % improvement and likely distinguishable from noise. The same couple-millisecond win on a 10 ms benchmark is a 20 % improvement but — against the same absolute noise — may now be indistinguishable because the benchmark itself is now shorter than the noise envelope.
Canonical verbatim framing from Anthony Shew's 2026-04-21 Turborepo post:
"As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise."
Observable in two places¶
- End-to-end wall-clock (
hyperfine/time). Each full run is a draw from a noisy distribution. - Profile self-times. Profiles themselves are sampled; run-to-run profile variance can make the same hot function look different between adjacent runs.
Structural reduction techniques¶
- Warmup runs. First-run costs (disk reads, code
loading, JIT warmup, page-cache fills) contaminate
early samples;
hyperfine --warmup Ndiscards the first N runs before timing begins. - Many timed runs + statistical reporting. Mean +
standard deviation + confidence interval (
hyperfine --runs M) rather than single-run numbers. - Sandbox benchmarking. Eliminate the ambient-system variance sources (Slack, Spotlight, browser tabs, cron jobs) by running in an ephemeral minimal- dependency container. See patterns/ephemeral-sandbox-benchmark-pair.
- Avoid cross-host comparison. Different machines have different sustained clock speeds, different thermal behaviour, different neighbour loads; cross-host A/B introduces variance that swamps small real wins.
- Pin frequency / disable turbo. For very sensitive measurements, explicitly pin CPU frequency to avoid turbo-boost-induced variance across runs.
Canonical variance-reduction datum¶
The Turborepo post's PR #11984 (stack-allocated OidHash)
documents a striking variance-reduction win alongside
the absolute-speed win:
| Repo size | Before (mean ± σ) | After (mean ± σ) | Variance reduction |
|---|---|---|---|
| ~1,000 pkg | 1.463 s ± 52 ms | 1.466 s ± 27 ms | 48 % |
| ~125 pkg | 659 ms ± 145 ms | 592 ms ± 63 ms | 57 % |
| 6 pkg | 97 ms ± 47 ms | 75 ms ± 18 ms | 61 % |
The 1,000-package row is particularly striking: no
mean-time improvement (1.463 s ↔ 1.466 s) but 48 %
less variance. Moving SHA-1 hashes from heap-allocated
String to stack-allocated [u8; 40] removed 10,000+
heap allocations per run; the resulting reduction in
allocator pressure made the wall-clock more
predictable, even when the mean stayed the same.
The post frames this verbatim: "The most notable improvement across all three sizes was the reduction in run-to-run variance, which agrees with our theory of less allocator pressure and more predictable performance."
Variance reduction is a real performance win in its own right — predictable p99 behaviour matters more than median in production systems — even when the mean doesn't move.
Related¶
- concepts/sandbox-benchmarking-for-signal-isolation — the canonical noise-floor-reduction technique when real wins approach laptop noise.
- concepts/benchmark-methodology-bias — adjacent concept covering benchmark construction failures; run-to-run variance is about measurement noise.
- concepts/microbenchmark-vs-end-to-end-gap — orthogonal concern about microbench vs real workload; variance-handling applies to both.
- systems/hyperfine — tool that handles warmup + N timed runs + statistical reporting out of the box.
- patterns/ephemeral-sandbox-benchmark-pair — the clean-room pattern for minimum-noise measurement.
Seen in¶
- sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans — definitional source; canonicalises the variance- reduction-as-performance-win framing with the 48 % / 57 % / 61 % variance-reduction datum across three repo sizes, and the "is my 2 % win real or just a quiet run?" observation as the motivating measurement problem.