Skip to content

CONCEPT Cited by 1 source

Sandbox benchmarking for signal isolation

Definition

Sandbox benchmarking for signal isolation is the practice of running A/B performance benchmarks inside an ephemeral minimal-dependency container — a sandbox with only the binaries being compared and the benchmark driver — specifically to eliminate the ambient-system noise floor that makes couple-percent wall-clock wins indistinguishable from variance on a developer laptop or shared CI runner.

Why it matters

As a codebase gets faster, the absolute size of real wins shrinks while the absolute size of system noise stays constant. Once functions are in the hundreds-of- microseconds range, ambient noise on a laptop (Slack notifications, Spotlight indexing, cron jobs, browser tabs, antivirus scans) produces run-to-run variance comparable to real wins. Canonical verbatim framing from Anthony Shew's 2026-04-21 Turborepo post:

"The problem became measurement. I had been running all benchmarks on my MacBook, and the hyperfine reports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science."

What the sandbox changes

Ephemeral minimal-dependency sandboxes (e.g. Vercel Sandbox) have:

  • No background daemons (no fsevents, no mds, no Slack helper, no Dropbox, no antivirus).
  • No Slack notifications / other interactive surfaces.
  • No browser tabs / indexers contending for CPU, disk, or memory.
  • Minimal container image — only what was explicitly copied in.
  • Fresh network state — no background DNS resolution, no metrics-agent pushes.

The result is a much lower noise floor; 2 % real wins re-emerge as detectable signal above the residual variance.

Critical caveat: within-sandbox A/B only

Ephemeral sandboxes typically don't guarantee dedicated hardware — they share physical hosts with other tenants. Canonical verbatim caveat from the post:

"Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions."

So the isolation guarantee is within a sandbox instance: both binaries run on the same physical host at the same time, eliminating cross-host-variance, cross-region-variance, and noisy-neighbour-variance as confounds for that specific A/B comparison. Cross- sandbox comparisons re-introduce noisy-neighbour confounds.

Canonical workflow (hyperfine + sandbox pair)

The post's full gist:

  1. Cross-compile both main and branch binaries locally (e.g. zig cc -target x86_64-linux-gnu + cargo build --release).
  2. Create a snapshot sandbox from a prebuilt image.
  3. Copy both binaries into the sandbox.
  4. Run hyperfine inside the sandbox: hyperfine --warmup 2 --runs 15 'turbo-main run build --dry' 'turbo-branch run build --dry'.
  5. Optionally also run turbo-main --profile and turbo-branch --profile to collect profiles.
  6. Copy reports + profiles back to laptop for the agent / human to analyse.

See patterns/ephemeral-sandbox-benchmark-pair for the canonical pattern.

When it matters

  • Code has been optimised to the point where laptop noise is larger than real wins (the inflection point Shew hit).
  • Regression detection — catching 2-3 % regressions in CI requires much tighter signal than feature-development benchmarks.
  • A/B choice between near-equivalent implementations — distinguishing "this one is 3 % faster" from "this one was a lucky run" requires reliable measurement.

When it doesn't

  • Large absolute wins (>20 %) on either initial optimisation or algorithmic improvement are detectable even with high noise floors.
  • Single-machine benchmarking for coarse-grained hypothesis testing is fine — engineers don't need sandbox clean-rooms to identify obvious algorithmic wins.
  • Benchmarks that need production-representative workload shape may be more important than pure noise-floor minimisation.

Seen in

Last updated · 476 distilled / 1,218 read