PATTERN Cited by 1 source
Ephemeral sandbox benchmark pair¶
Problem¶
Developer laptops and shared CI runners have unpredictable ambient load — Slack notifications, Spotlight indexing, cron jobs, browser tabs, antivirus scans, metrics agents, other services. Once the code under test is fast enough that individual functions run in the hundreds-of-microseconds range, the noise floor from ambient load becomes comparable to real performance wins, and benchmarks can no longer reliably distinguish a couple-percent improvement from a lucky run. See concepts/run-to-run-variance + concepts/sandbox-benchmarking-for-signal-isolation for the structural analysis.
The pattern¶
Cross-compile both the main and the branch binaries
locally; create a fresh ephemeral minimal-dependency
sandbox; copy both binaries in; run the benchmark pair
under hyperfine inside the sandbox;
collect reports; destroy the sandbox.
The single-sandbox-instance invariant is load- bearing: because ephemeral sandbox instances typically don't guarantee dedicated hardware, different instances may be on different physical hosts with different noisy-neighbour loads. Running both binaries under one sandbox ensures the A/B comparison happens on identical hardware at roughly identical times.
Canonical workflow¶
From Anthony Shew's 2026-04-21 Turborepo post (full gist):
# Cross-compile for the sandbox's Linux target
zig cc -target x86_64-linux-gnu ...
cargo build --release --target x86_64-unknown-linux-gnu
# Create ephemeral sandbox
sandbox create --snapshot turbo-bench-snapshot
# Copy both binaries into the same sandbox
sandbox cp ./target/release/turbo-main sandbox:/usr/local/bin/turbo-main
sandbox cp ./target/release/turbo-branch sandbox:/usr/local/bin/turbo-branch
# hyperfine A/B inside the sandbox
sandbox exec -- hyperfine \
--warmup 2 --runs 15 \
'turbo-main run build --dry' \
'turbo-branch run build --dry'
# Also collect profiles if needed
sandbox exec -- turbo-main run build --profile=main-profile
sandbox exec -- turbo-branch run build --profile=branch-profile
# Pull reports + profiles back for local analysis
sandbox cp sandbox:/reports/ ./local-reports/
The agent / engineer then inspects hyperfine reports + Markdown profiles locally — the sandbox has done its job (clean A/B measurement) and can be torn down.
Properties¶
- Noise-floor minimisation. Only the binaries you copied in are running — no background daemons competing for CPU, disk, memory, or network.
- Deterministic-comparison within a sandbox. Both binaries experience the same ambient load (whatever residual noise exists on that host); differences in measurement reflect differences in the binaries.
- No cross-sandbox comparisons. Canonical caveat verbatim: "Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions."
- Cross-compiled-locally, benchmarked-remotely. The local dev env still owns the code and build; the remote sandbox is purely a measurement substrate.
Why hyperfine specifically¶
hyperfine composes naturally with this pattern because:
--warmup Ndiscards cold-start effects (disk reads, code loading, memory layout).--runs Mgives enough timed samples for statistical reporting (mean ± σ, confidence intervals).- A/B syntax (
hyperfine 'cmd-a' 'cmd-b') produces the two-binary comparison directly with relative reporting.
Composition with the supervised agent loop¶
The pattern is the validation gate in Plan-Mode-then-implement:
- Agent profiles in the sandbox → emits Markdown profile.
- Plan-Mode agent proposes optimisations from the profile.
- Human approves a proposal.
- Agent implements the change on a branch.
- Sandbox hyperfine A/B of
mainvsbranchbinaries validates end-to-end wall-clock. - If real win → PR merge.
Step 5 is this pattern's canonical role.
What Sandbox benchmarking enabled in the Turborepo campaign¶
The post's low-level wins that were invisible on laptop but clear in sandbox:
- PR #11984 — Stack-allocated git OIDs.
new_from_gix_indexself-time dropped 15 %; run-to-run variance dropped 48 % / 57 % / 61 % across three repo sizes. - PR #11985 — Syscall elimination.
fetchself-time dropped 35 % (200.5 ms → 129.6 ms over 962 cache fetches) by removing a legacy.tarprobe. - PR #11986 — Move instead of clone. Per-task
HashMap::remove()at zero cost replaced a deep clone of ~1,700 per-task maps.
Each of these moves the end-to-end wall-clock by only a couple percent; without sandbox-level signal isolation the wins would have been indistinguishable from noise.
Anti-patterns¶
- Cross-sandbox A/B. Comparing reports from two different sandbox instances re-introduces cross-host variance — worse than laptop A/B for many workloads.
- Running both binaries at different times in the same sandbox if ambient load varies. If the sandbox is long-lived and shared, ambient load inside the sandbox itself can drift; hyperfine's warmup + many- runs discipline mitigates but doesn't eliminate this. Fresh short-lived sandboxes are safer.
- Ignoring statistical significance. hyperfine's output gives mean ± σ; a 2 % faster binary with overlapping σ ranges is not a real win. Discipline is the engineer's, not the sandbox's.
- Using sandboxes before the noise floor is the problem. If laptop A/B shows a 30 % win, sandbox adds no value; the complexity is only worth it when real wins have gotten small relative to noise.
Alternatives¶
- Dedicated benchmarking hardware. Bare-metal machines with tuned BIOS settings, pinned CPU frequencies, no ambient services. Higher-fidelity but higher operational cost.
- Pinned-CPU bare-metal benchmarking frameworks (Netflix's Java Vector-API work used HotSpot-aware bare-metal benchmarks). Appropriate for library-level micro-optimisation; overkill for end-to-end CLI performance.
- Stats-heavy benchmarking on noisy laptop. More runs + more statistical rigour can compensate for noise but at rapidly diminishing returns; past a certain noise level no number of runs is enough.
Seen in¶
- Making Turborepo 96 % faster (Vercel, 2026-04-21) — canonical wiki instance; definitional source for this pattern; workflow gist linked directly from the post; multiple concrete PRs (#11984, #11985, #11986) that were only detectable with sandbox signal isolation.
Related¶
- concepts/sandbox-benchmarking-for-signal-isolation — parent concept; structural analysis of why this pattern works.
- concepts/run-to-run-variance — the measurement phenomenon this pattern minimises.
- systems/vercel-sandbox — canonical substrate.
- systems/hyperfine — canonical benchmark driver.
- patterns/plan-mode-then-implement-agent-loop — the supervised-agent loop where this pattern is the validation gate.
- patterns/measurement-driven-micro-optimization — parent pattern class; end-to-end validation discipline canonicalised here at the agent-assisted altitude.