Skip to content

PATTERN Cited by 1 source

Ephemeral sandbox benchmark pair

Problem

Developer laptops and shared CI runners have unpredictable ambient load — Slack notifications, Spotlight indexing, cron jobs, browser tabs, antivirus scans, metrics agents, other services. Once the code under test is fast enough that individual functions run in the hundreds-of-microseconds range, the noise floor from ambient load becomes comparable to real performance wins, and benchmarks can no longer reliably distinguish a couple-percent improvement from a lucky run. See concepts/run-to-run-variance + concepts/sandbox-benchmarking-for-signal-isolation for the structural analysis.

The pattern

Cross-compile both the main and the branch binaries locally; create a fresh ephemeral minimal-dependency sandbox; copy both binaries in; run the benchmark pair under hyperfine inside the sandbox; collect reports; destroy the sandbox.

The single-sandbox-instance invariant is load- bearing: because ephemeral sandbox instances typically don't guarantee dedicated hardware, different instances may be on different physical hosts with different noisy-neighbour loads. Running both binaries under one sandbox ensures the A/B comparison happens on identical hardware at roughly identical times.

Canonical workflow

From Anthony Shew's 2026-04-21 Turborepo post (full gist):

# Cross-compile for the sandbox's Linux target
zig cc -target x86_64-linux-gnu ...
cargo build --release --target x86_64-unknown-linux-gnu

# Create ephemeral sandbox
sandbox create --snapshot turbo-bench-snapshot

# Copy both binaries into the same sandbox
sandbox cp ./target/release/turbo-main   sandbox:/usr/local/bin/turbo-main
sandbox cp ./target/release/turbo-branch sandbox:/usr/local/bin/turbo-branch

# hyperfine A/B inside the sandbox
sandbox exec -- hyperfine \
  --warmup 2 --runs 15 \
  'turbo-main run build --dry' \
  'turbo-branch run build --dry'

# Also collect profiles if needed
sandbox exec -- turbo-main   run build --profile=main-profile
sandbox exec -- turbo-branch run build --profile=branch-profile

# Pull reports + profiles back for local analysis
sandbox cp sandbox:/reports/ ./local-reports/

The agent / engineer then inspects hyperfine reports + Markdown profiles locally — the sandbox has done its job (clean A/B measurement) and can be torn down.

Properties

  • Noise-floor minimisation. Only the binaries you copied in are running — no background daemons competing for CPU, disk, memory, or network.
  • Deterministic-comparison within a sandbox. Both binaries experience the same ambient load (whatever residual noise exists on that host); differences in measurement reflect differences in the binaries.
  • No cross-sandbox comparisons. Canonical caveat verbatim: "Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions."
  • Cross-compiled-locally, benchmarked-remotely. The local dev env still owns the code and build; the remote sandbox is purely a measurement substrate.

Why hyperfine specifically

hyperfine composes naturally with this pattern because:

  • --warmup N discards cold-start effects (disk reads, code loading, memory layout).
  • --runs M gives enough timed samples for statistical reporting (mean ± σ, confidence intervals).
  • A/B syntax (hyperfine 'cmd-a' 'cmd-b') produces the two-binary comparison directly with relative reporting.

Composition with the supervised agent loop

The pattern is the validation gate in Plan-Mode-then-implement:

  1. Agent profiles in the sandbox → emits Markdown profile.
  2. Plan-Mode agent proposes optimisations from the profile.
  3. Human approves a proposal.
  4. Agent implements the change on a branch.
  5. Sandbox hyperfine A/B of main vs branch binaries validates end-to-end wall-clock.
  6. If real win → PR merge.

Step 5 is this pattern's canonical role.

What Sandbox benchmarking enabled in the Turborepo campaign

The post's low-level wins that were invisible on laptop but clear in sandbox:

  • PR #11984 — Stack-allocated git OIDs. new_from_gix_index self-time dropped 15 %; run-to-run variance dropped 48 % / 57 % / 61 % across three repo sizes.
  • PR #11985 — Syscall elimination. fetch self-time dropped 35 % (200.5 ms → 129.6 ms over 962 cache fetches) by removing a legacy .tar probe.
  • PR #11986 — Move instead of clone. Per-task HashMap::remove() at zero cost replaced a deep clone of ~1,700 per-task maps.

Each of these moves the end-to-end wall-clock by only a couple percent; without sandbox-level signal isolation the wins would have been indistinguishable from noise.

Anti-patterns

  • Cross-sandbox A/B. Comparing reports from two different sandbox instances re-introduces cross-host variance — worse than laptop A/B for many workloads.
  • Running both binaries at different times in the same sandbox if ambient load varies. If the sandbox is long-lived and shared, ambient load inside the sandbox itself can drift; hyperfine's warmup + many- runs discipline mitigates but doesn't eliminate this. Fresh short-lived sandboxes are safer.
  • Ignoring statistical significance. hyperfine's output gives mean ± σ; a 2 % faster binary with overlapping σ ranges is not a real win. Discipline is the engineer's, not the sandbox's.
  • Using sandboxes before the noise floor is the problem. If laptop A/B shows a 30 % win, sandbox adds no value; the complexity is only worth it when real wins have gotten small relative to noise.

Alternatives

  • Dedicated benchmarking hardware. Bare-metal machines with tuned BIOS settings, pinned CPU frequencies, no ambient services. Higher-fidelity but higher operational cost.
  • Pinned-CPU bare-metal benchmarking frameworks (Netflix's Java Vector-API work used HotSpot-aware bare-metal benchmarks). Appropriate for library-level micro-optimisation; overkill for end-to-end CLI performance.
  • Stats-heavy benchmarking on noisy laptop. More runs + more statistical rigour can compensate for noise but at rapidly diminishing returns; past a certain noise level no number of runs is enough.

Seen in

  • Making Turborepo 96 % faster (Vercel, 2026-04-21) — canonical wiki instance; definitional source for this pattern; workflow gist linked directly from the post; multiple concrete PRs (#11984, #11985, #11986) that were only detectable with sandbox signal isolation.
Last updated · 476 distilled / 1,218 read