Skip to content

CONCEPT Cited by 1 source

Run-to-run variance

Definition

Run-to-run variance is the variability in measured latency between repeated invocations of the same benchmark on the same binary. It has structural causes (syscall scheduling, page-cache warmth, disk I/O queuing, NUMA placement, CPU frequency scaling) and ambient-system causes (other processes, timer interrupts, memory pressure, background network).

The noise floor below which it becomes impossible to distinguish a real couple-percent performance win from a lucky run.

Why it matters more as code gets faster

As the code under test gets faster, the absolute size of real wins shrinks while the absolute size of noise stays roughly constant. A couple-millisecond real win on a 100 ms benchmark is a 2 % improvement and likely distinguishable from noise. The same couple-millisecond win on a 10 ms benchmark is a 20 % improvement but — against the same absolute noise — may now be indistinguishable because the benchmark itself is now shorter than the noise envelope.

Canonical verbatim framing from Anthony Shew's 2026-04-21 Turborepo post:

"As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise."

Observable in two places

  • End-to-end wall-clock (hyperfine / time). Each full run is a draw from a noisy distribution.
  • Profile self-times. Profiles themselves are sampled; run-to-run profile variance can make the same hot function look different between adjacent runs.

Structural reduction techniques

  • Warmup runs. First-run costs (disk reads, code loading, JIT warmup, page-cache fills) contaminate early samples; hyperfine --warmup N discards the first N runs before timing begins.
  • Many timed runs + statistical reporting. Mean + standard deviation + confidence interval (hyperfine --runs M) rather than single-run numbers.
  • Sandbox benchmarking. Eliminate the ambient-system variance sources (Slack, Spotlight, browser tabs, cron jobs) by running in an ephemeral minimal- dependency container. See patterns/ephemeral-sandbox-benchmark-pair.
  • Avoid cross-host comparison. Different machines have different sustained clock speeds, different thermal behaviour, different neighbour loads; cross-host A/B introduces variance that swamps small real wins.
  • Pin frequency / disable turbo. For very sensitive measurements, explicitly pin CPU frequency to avoid turbo-boost-induced variance across runs.

Canonical variance-reduction datum

The Turborepo post's PR #11984 (stack-allocated OidHash) documents a striking variance-reduction win alongside the absolute-speed win:

Repo size Before (mean ± σ) After (mean ± σ) Variance reduction
~1,000 pkg 1.463 s ± 52 ms 1.466 s ± 27 ms 48 %
~125 pkg 659 ms ± 145 ms 592 ms ± 63 ms 57 %
6 pkg 97 ms ± 47 ms 75 ms ± 18 ms 61 %

The 1,000-package row is particularly striking: no mean-time improvement (1.463 s ↔ 1.466 s) but 48 % less variance. Moving SHA-1 hashes from heap-allocated String to stack-allocated [u8; 40] removed 10,000+ heap allocations per run; the resulting reduction in allocator pressure made the wall-clock more predictable, even when the mean stayed the same.

The post frames this verbatim: "The most notable improvement across all three sizes was the reduction in run-to-run variance, which agrees with our theory of less allocator pressure and more predictable performance."

Variance reduction is a real performance win in its own right — predictable p99 behaviour matters more than median in production systems — even when the mean doesn't move.

Seen in

Last updated · 476 distilled / 1,218 read