CONCEPT Cited by 1 source
Sandbox benchmarking for signal isolation¶
Definition¶
Sandbox benchmarking for signal isolation is the practice of running A/B performance benchmarks inside an ephemeral minimal-dependency container — a sandbox with only the binaries being compared and the benchmark driver — specifically to eliminate the ambient-system noise floor that makes couple-percent wall-clock wins indistinguishable from variance on a developer laptop or shared CI runner.
Why it matters¶
As a codebase gets faster, the absolute size of real wins shrinks while the absolute size of system noise stays constant. Once functions are in the hundreds-of- microseconds range, ambient noise on a laptop (Slack notifications, Spotlight indexing, cron jobs, browser tabs, antivirus scans) produces run-to-run variance comparable to real wins. Canonical verbatim framing from Anthony Shew's 2026-04-21 Turborepo post:
"The problem became measurement. I had been running all benchmarks on my MacBook, and the
hyperfinereports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science."
What the sandbox changes¶
Ephemeral minimal-dependency sandboxes (e.g. Vercel Sandbox) have:
- No background daemons (no
fsevents, nomds, no Slack helper, no Dropbox, no antivirus). - No Slack notifications / other interactive surfaces.
- No browser tabs / indexers contending for CPU, disk, or memory.
- Minimal container image — only what was explicitly copied in.
- Fresh network state — no background DNS resolution, no metrics-agent pushes.
The result is a much lower noise floor; 2 % real wins re-emerge as detectable signal above the residual variance.
Critical caveat: within-sandbox A/B only¶
Ephemeral sandboxes typically don't guarantee dedicated hardware — they share physical hosts with other tenants. Canonical verbatim caveat from the post:
"Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions."
So the isolation guarantee is within a sandbox instance: both binaries run on the same physical host at the same time, eliminating cross-host-variance, cross-region-variance, and noisy-neighbour-variance as confounds for that specific A/B comparison. Cross- sandbox comparisons re-introduce noisy-neighbour confounds.
Canonical workflow (hyperfine + sandbox pair)¶
The post's full gist:
- Cross-compile both
mainandbranchbinaries locally (e.g.zig cc -target x86_64-linux-gnu+cargo build --release). - Create a snapshot sandbox from a prebuilt image.
- Copy both binaries into the sandbox.
- Run hyperfine inside the sandbox:
hyperfine --warmup 2 --runs 15 'turbo-main run build --dry' 'turbo-branch run build --dry'. - Optionally also run
turbo-main --profileandturbo-branch --profileto collect profiles. - Copy reports + profiles back to laptop for the agent / human to analyse.
See patterns/ephemeral-sandbox-benchmark-pair for the canonical pattern.
When it matters¶
- Code has been optimised to the point where laptop noise is larger than real wins (the inflection point Shew hit).
- Regression detection — catching 2-3 % regressions in CI requires much tighter signal than feature-development benchmarks.
- A/B choice between near-equivalent implementations — distinguishing "this one is 3 % faster" from "this one was a lucky run" requires reliable measurement.
When it doesn't¶
- Large absolute wins (>20 %) on either initial optimisation or algorithmic improvement are detectable even with high noise floors.
- Single-machine benchmarking for coarse-grained hypothesis testing is fine — engineers don't need sandbox clean-rooms to identify obvious algorithmic wins.
- Benchmarks that need production-representative workload shape may be more important than pure noise-floor minimisation.
Seen in¶
- sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans — canonical instance; definitional source for the concept; Vercel Sandbox as the clean-signal substrate; enabling concept for the low-level wins that were invisible on laptop (stack-allocated OIDs, syscall elimination, move-instead-of-clone — all with 2-20 % wins dominated by noise on laptop but clear in sandbox).
Related¶
- systems/vercel-sandbox — canonical substrate; this post adds the benchmarking-substrate altitude to the prior agent-execution substrate framing from the Knowledge Agent Template ingest.
- systems/hyperfine — the benchmark driver that runs inside the sandbox.
- concepts/run-to-run-variance — the phenomenon sandbox benchmarking mitigates.
- concepts/benchmark-methodology-bias — adjacent concept covering the broader class of benchmarking methodological failures; sandbox isolation is one specific technique.
- patterns/ephemeral-sandbox-benchmark-pair — the canonical pattern.
- patterns/measurement-driven-micro-optimization — parent pattern class.