CONCEPT Cited by 1 source

Sandbox benchmarking for signal isolation¶

Definition¶

Sandbox benchmarking for signal isolation is the practice of running A/B performance benchmarks inside an ephemeral minimal-dependency container — a sandbox with only the binaries being compared and the benchmark driver — specifically to eliminate the ambient-system noise floor that makes couple-percent wall-clock wins indistinguishable from variance on a developer laptop or shared CI runner.

Why it matters¶

As a codebase gets faster, the absolute size of real wins shrinks while the absolute size of system noise stays constant. Once functions are in the hundreds-of- microseconds range, ambient noise on a laptop (Slack notifications, Spotlight indexing, cron jobs, browser tabs, antivirus scans) produces run-to-run variance comparable to real wins. Canonical verbatim framing from Anthony Shew's 2026-04-21 Turborepo post:

"The problem became measurement. I had been running all benchmarks on my MacBook, and the hyperfine reports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science."

What the sandbox changes¶

Ephemeral minimal-dependency sandboxes (e.g. Vercel Sandbox) have:

No background daemons (no fsevents, no mds, no Slack helper, no Dropbox, no antivirus).
No Slack notifications / other interactive surfaces.
No browser tabs / indexers contending for CPU, disk, or memory.
Minimal container image — only what was explicitly copied in.
Fresh network state — no background DNS resolution, no metrics-agent pushes.

The result is a much lower noise floor; 2 % real wins re-emerge as detectable signal above the residual variance.

Critical caveat: within-sandbox A/B only¶

Ephemeral sandboxes typically don't guarantee dedicated hardware — they share physical hosts with other tenants. Canonical verbatim caveat from the post:

"Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions."

So the isolation guarantee is within a sandbox instance: both binaries run on the same physical host at the same time, eliminating cross-host-variance, cross-region-variance, and noisy-neighbour-variance as confounds for that specific A/B comparison. Cross- sandbox comparisons re-introduce noisy-neighbour confounds.

Canonical workflow (hyperfine + sandbox pair)¶

The post's full gist:

Cross-compile both main and branch binaries locally (e.g. zig cc -target x86_64-linux-gnu + cargo build --release).
Create a snapshot sandbox from a prebuilt image.
Copy both binaries into the sandbox.
Run hyperfine inside the sandbox: hyperfine --warmup 2 --runs 15 'turbo-main run build --dry' 'turbo-branch run build --dry'.
Optionally also run turbo-main --profile and turbo-branch --profile to collect profiles.
Copy reports + profiles back to laptop for the agent / human to analyse.

See patterns/ephemeral-sandbox-benchmark-pair for the canonical pattern.

When it matters¶

Code has been optimised to the point where laptop noise is larger than real wins (the inflection point Shew hit).
Regression detection — catching 2-3 % regressions in CI requires much tighter signal than feature-development benchmarks.
A/B choice between near-equivalent implementations — distinguishing "this one is 3 % faster" from "this one was a lucky run" requires reliable measurement.

When it doesn't¶

Large absolute wins (>20 %) on either initial optimisation or algorithmic improvement are detectable even with high noise floors.
Single-machine benchmarking for coarse-grained hypothesis testing is fine — engineers don't need sandbox clean-rooms to identify obvious algorithmic wins.
Benchmarks that need production-representative workload shape may be more important than pure noise-floor minimisation.

Seen in¶

sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans — canonical instance; definitional source for the concept; Vercel Sandbox as the clean-signal substrate; enabling concept for the low-level wins that were invisible on laptop (stack-allocated OIDs, syscall elimination, move-instead-of-clone — all with 2-20 % wins dominated by noise on laptop but clear in sandbox).

systems/vercel-sandbox — canonical substrate; this post adds the benchmarking-substrate altitude to the prior agent-execution substrate framing from the Knowledge Agent Template ingest.
systems/hyperfine — the benchmark driver that runs inside the sandbox.
concepts/run-to-run-variance — the phenomenon sandbox benchmarking mitigates.
concepts/benchmark-methodology-bias — adjacent concept covering the broader class of benchmarking methodological failures; sandbox isolation is one specific technique.
patterns/ephemeral-sandbox-benchmark-pair — the canonical pattern.
patterns/measurement-driven-micro-optimization — parent pattern class.