Vercel — Making Turborepo 96% faster with agents, sandboxes, and humans¶
Summary¶
Anthony Shew's 2026-04-21 Vercel engineering post documents an eight-day performance campaign that improved Turborepo's task-graph construction time by 81-91 % on Vercel's internal repositories (up to 96 % on some external customer repos). The load-bearing disclosure is Time to First Task on Vercel's 1,000+ package monorepo dropped from 8.1 s → 716 ms (91 % faster, 11× speedup) — the post discloses the full v2.8.0 → v2.9.0 regression table across three repo sizes.
The post is simultaneously a performance retrospective (three categories of wins: parallelisation, allocation elimination, syscall reduction, each with linked PR numbers) and an engineering-process retrospective on what unattended and supervised AI agents actually delivered at the limits of current agent tooling. The technical wins are pedestrian; the interesting payload is the five concrete lessons about agent-assisted performance engineering the author earned across 8 days of iteration.
Headline numbers verbatim from the v2.8.0 → v2.9.0 comparison table:
| Repo size | v2.8.0 | v2.9.0 | Improvement |
|---|---|---|---|
| ~1,000 packages | 8.1 s | 0.716 s | 91 % |
| 132 packages | 1.9 s | 0.361 s | 81 % |
| 6 packages | 0.676 s | 0.132 s | 80 % |
Five load-bearing architectural lessons the post canonicalises (each is a new wiki primitive):
-
Markdown profile output beats Chrome Trace Event JSON for agent consumption. Shew added a
turborepo-profile-mdcrate (PR #11880) that emits a companion.mdfile alongside every Chrome Trace JSON. "Same model, same codebase, same data, same agent harness. Different format, radically better optimization suggestions." Canonical heuristic verbatim: "if something is poorly designed for me to work with, it's poorly designed for an agent, too." -
Vercel Sandbox provides clean-signal benchmarking that a laptop can't. Once the code is fast enough, MacBook background noise — Slack notifications, cron jobs, spotlight indexing — drowns out real improvements. Vercel Sandboxes are "ephemeral Linux containers that only have what you put in them. No background daemons, no Slack notifications pulling CPU, no background programs making network requests." Caveat: Sandboxes don't guarantee dedicated hardware, so all A/B comparisons must come from a single sandbox instance running both binaries.
-
Your own source code is the best agent feedback. Once a correction is merged into the codebase, the agent naturally adopts the corrected pattern in future sessions even without memory or context carrying across chats. "Once I corrected one instance, the agent followed the correction going forward. In future conversations, without any memory or context carrying across chats, the agent would see the merged improvements in the source and stop reproducing the old patterns." The codebase becomes implicit long-term memory.
-
Unattended agents fall short in five distinctive ways. Shew spun up 8 background coding agents from his phone; 3 produced shippable PRs. The 5 agents produced nothing; the 3 successful agents still exhibited five named failure modes: (a) no end-to-end dogfooding even when the system could (Turborepo builds Turborepo but the agent never used that loop), (b) hyperfixation on first idea, (c) microbenchmark-chasing that didn't translate to real-world wins, (d) no regression tests, (e) never used the
--profileflag. Canonical Ralph-Wiggum-loop disclosure: "I did try to turn this into a Ralph Wiggum loop but it repeatedly made too many mistakes. The combination of the model, the harness, and the loop simply weren't dependable enough." -
Plan-mode-then-implement beats unattended spawn for production-critical work. Shew's supervised loop explicitly separates propose (agent in Plan Mode generating hotspot analysis against a profile) from execute (agent making approved changes) from validate (human-in-loop via
hyperfineend-to-end). 20+ performance PRs in 4 days using this loop.
Key takeaways¶
-
91 % faster Time to First Task on 1,000+ packages is the one canonical operational number: Turborepo v2.8.0 took 8.1 s to construct the task graph on Vercel's 1,000-package monorepo before the first task could run; v2.9.0 takes 716 ms. "Building the task graph is overhead you pay before your repository's work begins. The larger the repo, the higher the cost." This is the
turbo runstartup tax — not the build itself — and it is paid on every invocation. 11× speedup on the largest repo; the smaller repos also improved (80-81 %) but the absolute impact is heavily tilted to large monorepos because that's where the cost lives. (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
8 background agents → 3 shippable PRs is the unattended agent baseline. Shew spawned 8 background coding agents from his phone with variations on the same prompt ("Look for a performance speedup in our Rust code. It has to be something that is well-tested, and on our hot path. Make sure to add benches to check your work. I'm particularly interested in our hashing code."), rotating the area of interest in each variant. 3 of 8 produced output that became real PRs: PR #11872 (~25 % wall-clock via hashing-by-reference instead of cloning a
HashMap), PR #11874 (~6 % win fromtwox-hash→xxhash-rust), PR #11878 (unnecessary Floyd-Warshall → multi-source DFS from an existingTODOcomment, off the hot path). The 5 of 8 failure rate is the canonical unattended-agent yield at Vercel-internal-prompt-quality with current models + harnesses as of 2026-04. (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
Markdown profile format produces "radically better optimization suggestions" than Chrome Trace JSON at the same tokens. The post shows both formats side by side. Chrome Trace Event Format (the ubiquitous format for Perfetto-loadable profiles, emitted by default by Turborepo, Chromium's
chrome://tracing, Go's pprof-to-trace, Node.js--perf-basic-prof-only, Rust'stracing-chrome) puts function identifiers split across lines, interleaves irrelevant metadata with timing data, and is not grep-friendly. The newturborepo-profile-mdcrate emits a companion.mdwith "Hot functions sorted by self-time, call trees sorted by total-time, caller/callee relationships. All greppable, all on single lines." Verbatim: "Same model, same codebase, same data, same agent harness. Different format, radically better optimization suggestions. The profile data was finally in a format that both I and the agent could read at a glance." The reframing also led Shew to a load-bearing heuristic: "if something is poorly designed for me to work with, it's poorly designed for an agent, too." Precedent: Bun's--cpu-prof-mdflag (Jarred Sumner, 2026-04 — Shew credits the tweet as motivating his own work). (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
hyperfine-on-laptop stops distinguishing real wins from noise once the code is fast enough. The second breakthrough is signal-isolation via Vercel Sandbox as a clean measurement substrate. Verbatim: "I had been running all benchmarks on my MacBook, and thehyperfinereports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance. The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal. Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science." The sandbox workflow (full gist linked from the post) does azig cc-cross-compiled Linux build of the main and branch binaries, creates a snapshot sandbox, copies both binaries in, runshyperfine --warmup 2 --runs 15 turbo-main run build --dry turbo-branch run build --dry, collects profiles for both, copies reports back. Important caveat disclosed verbatim: "Vercel Sandboxes don't guarantee dedicated hardware today. Comparing reports from different Sandbox instances might not be useful. All comparisons should come from a single instance where both binaries run under identical conditions." — A/B comparison works within one sandbox but not across sandboxes. (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
Source code is implicit long-term memory for agents across sessions. Shew's most interesting observation is non-obvious: "In places where the existing code had a sloppy pattern, the agent would write new code in the same style. Once I corrected one instance, the agent followed the correction going forward. In future conversations, without any memory or context carrying across chats, the agent would see the merged improvements in the source and stop reproducing the old patterns." This is the implicit-memory-via-source-code pattern: corrections merged to mainline become the agent's long-term preference, even across new conversations with no explicit memory transfer. Related framing verbatim: "Over time, I noticed the agent spontaneously writing tests when I wasn't expecting it to. I saw it creating abstractions that matched what I would have done ... your own source code is the best reinforcement learning out there." Alignment with precomputed agent context files (Figma's CONTEXT.md canonicalisation): both canonicalise the codebase / repository artefacts as an agent-facing substrate, but Shew's framing is implicit (no dedicated context file — the agent reads the actual source and infers) while Figma's is explicit (a curated
CONTEXT.mdis precomputed). (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
Five failure modes of unattended Rust-performance agents. Verbatim from the review of the 8-agent phone- spawn experiment: (1) "The agent never realized it could benchmark the improvements on the Turborepo codebase itself. Turborepo dogfoods Turborepo, so it could have easily built a binary and run it right on the source code to get end-to-end results." — no dogfood-loop awareness. (2) "The agent would hyperfixate on the first idea that it came up with and force it to work, rather than backing up and thinking abstractly about the problem (even though the chat logs showed it trying to do so)." — hyperfixation on first hypothesis. (3) "The agent would chase the biggest number it could get, creating microbenchmarks that were relatively meaningless when it came to real-world performance. It would then crank out a 97% improvement for the benchmark, which actually amounted to a 0.02% real-world improvement." — microbenchmark-vs-end-to-end disconnect. (4) "Never once did an agent write a regression test." (5) "Never once did an agent use the
--profileflag in theturboCLI." These five failure modes motivate the supervised Plan-Mode-then-implement loop that produced the 20+ later PRs. (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans) -
Eight days total; supervised loop produced 20+ performance PRs in 4 days. The post sequences the campaign: Phase 1 (nights before sleep) = 8 unattended agents; Phase 2 (Markdown profile tooling + supervised Plan-Mode loop) = ~20 PRs over 4 days; Phase 3 (Sandbox benchmarking) = the final low-level wins that were invisible on laptop. Canonical supervised-loop verbatim: "Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output → Review the proposed optimizations and decide which ones were worth pursuing → Have the agent implement the good proposal(s) → Validate with end-to-end
hyperfinebenchmarks → Make a PR → Repeat." Time budget claim: "I estimate this would have taken at least two months without agents, but I hope this article shows you that they didn't do the work for me. I was leading the entire time." (Source: sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humans)
Systems extracted¶
- systems/turborepo (new) — Vercel's Rust-written task runner and build system for JavaScript monorepos; the task graph is the primary scheduling artefact; v2.9.0 the performance release subject of this post.
- systems/perfetto (new) — Chromium project's
successor to
chrome://tracing, consumes Chrome Trace Event Format JSON for flame graph + trace visualisation; Turborepo's--profileflag emits Perfetto-compatible JSON. - systems/hyperfine (new) —
sharkdp/hyperfine, the Rust-written command-line benchmarking tool (warmup runs + many timed runs + confidence intervals) Shew used for end-to-end validation both on laptop and inside Vercel Sandbox. - systems/xxhash-rust (new) — the faster hashing
crate Turborepo migrated to from
twox-hash(PR #11874, ~6 % win). - systems/vercel-sandbox (extends) — established by the 2026-04-21 Knowledge Agent Template ingest as per-request ephemeral Linux sandbox; this post adds the benchmarking-substrate altitude: same primitive used for signal-isolation in performance engineering, not just agent execution.
Concepts extracted¶
- concepts/markdown-as-agent-friendly-format (new) — Markdown's line-per-record + natural column alignment + grep-friendliness makes it a structurally better format for agent-readable data than line-broken JSON. Applies beyond profiles: diff output, error reports, search results.
- concepts/chrome-trace-event-format (new) — the
ubiquitous JSON format for CPU / trace profiles across
Chromium, Node.js, Go pprof, Rust tracing-chrome,
Turborepo. Structure is optimised for UI consumption
(flame graphs in Perfetto /
chrome://tracing), not agent consumption. - concepts/sandbox-benchmarking-for-signal-isolation (new) — the structural benefit of running A/B benchmarks inside an ephemeral minimal-dependency container: no background daemons, no Slack, no indexing, minimal system noise. Caveat: ephemeral sandboxes typically don't guarantee dedicated hardware, so cross-sandbox comparisons aren't valid — only within-sandbox A/B.
- concepts/source-code-as-agent-feedback-loop (new) — merged source code becomes the agent's implicit long- term memory across sessions. No per-session context transfer needed — corrections propagate via the agent reading the current state of the repo.
- concepts/agent-hyperfixation-failure-mode (new) — agents commit to their first hypothesis and force it to work rather than stepping back to reconsider the problem abstractly. Observable in chat logs even when the agent verbalises the need to reconsider.
- concepts/microbenchmark-vs-end-to-end-gap (new) — an optimisation that shows 97 % improvement in a narrow microbenchmark can amount to 0.02 % real-world improvement. Agents are susceptible to chasing microbenchmark numbers without end-to-end validation.
- concepts/run-to-run-variance (new) — the noise
floor in benchmark timing that increases relative to
real wins as code gets faster. The structural
motivation for sandbox-based signal isolation + many-
runs-plus-warmup (
hyperfine).
Patterns extracted¶
- patterns/markdown-profile-output-for-agents (new)
— emit a companion
.mdfile alongside machine-readable profile formats; agent consumption quality goes up materially without changing model or harness. - patterns/ephemeral-sandbox-benchmark-pair (new) — cross-compile both main and branch binaries, copy both into a single ephemeral sandbox, run the benchmark pair there under identical conditions, copy reports back. Isolates A/B from background noise.
- patterns/plan-mode-then-implement-agent-loop (new) — separate propose (agent in Plan Mode) / execute (agent applies changes) / validate (human-gated end-to-end benchmark) into distinct agent sessions. The human gates the implementation; the agent handles the mechanical work.
- patterns/agent-spawn-parallel-exploration (new) — fan out N unattended agents with prompt variations across the hypothesis space; review outputs in the morning; extract the subset that survives reality. Tolerates ~60 % failure rate because survivors are independently useful.
- patterns/codebase-correction-as-implicit-feedback (new) — merge corrections to the codebase once; future agent sessions read the corrected pattern and adopt it without explicit context transfer.
Operational numbers disclosed¶
| Metric | Value |
|---|---|
| Turborepo Time to First Task (1000 packages) v2.8.0 → v2.9.0 | 8.1 s → 716 ms (91 %) |
| Turborepo Time to First Task (132 packages) v2.8.0 → v2.9.0 | 1.9 s → 361 ms (81 %) |
| Turborepo Time to First Task (6 packages) v2.8.0 → v2.9.0 | 676 ms → 132 ms (80 %) |
| External-repo peak improvement | up to 96 % |
| Campaign duration | 8 days |
| Unattended agents spawned | 8 |
| Unattended agents yielding shippable PRs | 3 of 8 (~37 %) |
| Supervised Plan-Mode-loop PRs | 20+ in 4 days |
new_from_gix_index self-time reduction from stack-allocated OidHash (PR #11984) |
15 % |
get_package_file_hashes_from_index self-time reduction |
17 % |
Syscall-elimination fetch self-time reduction (PR #11985) |
200.5 ms → 129.6 ms (35 %) over 962 cache fetches |
twox-hash → xxhash-rust win (PR #11874) |
~6 % |
| Reference-hashing-instead-of-clone (PR #11872) | ~25 % wall-clock |
| OidHash stack-alloc run-to-run variance reduction | 48 % (1000 pkg) / 57 % (125 pkg) / 61 % (6 pkg) |
| Without-agents estimated time | ≥ 2 months (Shew's claim) |
Three categories of wins¶
Shew's supervised-loop PRs fell into three named categories:
Parallelisation (PR #11889,
#11902,
#11927,
#11918) —
git index, filesystem glob walk, lockfile parsing, and
package.json loading were all sequential, ran concurrently.
Allocation elimination (PR #11916, #11891, #11929) — reference-based hashing in SCM, pre-compiled glob exclusion filters, shared HTTP client instead of per-request construction.
Syscall reduction (PR #11887,
#11938,
#11950) —
per-package git subprocess calls → single repo-wide index;
git subprocesses → libgit2 library calls → gix-index.
Low-level Sandbox-signal-only wins:
-
PR #11984 — Stack-allocated git OIDs. SHA-1 40-char strings had been heap-allocated
String; newOidHashis[u8; 40]+Deref<Target=str>.new_from_gix_indexself-time dropped 15 %;get_package_file_hashes_from_indexself-time dropped 17 %. Canonical run-to-run variance reduction disclosures (48 % / 57 % / 61 % across three repo sizes) demonstrate the performance stability win alongside the absolute win. -
PR #11985 — Syscall elimination. Cache fetch was doing
stat(.tar)(returnsENOENT) +stat(.tar.zst) -
open(.tar.zst). The.tarpath was Turborepo's Golang-era (2021-2022) cache-format fallback; no modern version writes uncompressed cache. Removed the.tarprobe → 35 % reduction over 962 cache fetches. -
PR #11986 — Move instead of clone. Visitor dispatch loop deep-cloned a
(String, HashMap<String, String>)from a precomputed map for each of ~1,700 tasks; each task ID appears exactly once in the dispatch stream, soHashMap::remove()moves the value out at zero cost.
Caveats¶
- Vendor-stake framing. Vercel operates Sandbox; the post is a first-person Vercel engineering post (Anthony Shew, Turborepo maintainer). The sandbox-benchmarking story is load-bearing on substance but the vendor has a commercial interest in surfacing Sandbox as a benchmarking substrate.
- Single-maintainer-single-product. The eight-day campaign is a single-engineer campaign on a single well-understood codebase. Generalisability of the agent-yield numbers (3/8 unattended, 20+ supervised in 4 days) to other codebases is asserted, not measured.
- "Ralph Wiggum loop" is an informal reference to ghuntley.com/ralph — a fully-autonomous agent loop pattern; Shew explicitly rejects it for production-critical code but the rejection depends on the current model/harness dependability ("repeatedly made too many mistakes"), which may shift.
- Sandbox non-dedicated-hardware caveat is explicit in the post: cross-sandbox comparisons aren't valid. The sandbox-benchmarking claim is specifically about within-sandbox A/B isolation.
- Zero discussion of agent costs — the compute cost of 8 background agents for 1 night, 20+ supervised-loop sessions over 4 days, and many Sandbox sandboxes is not disclosed. "Without agents = at least 2 months" is the time-cost comparison; monetary cost comparison is absent.
- No model family disclosed. The post does not name which LLM family / model (Claude / GPT / Gemini / something else) was used across any phase of the campaign. "Same model" is asserted for the markdown- vs-JSON A/B but the model isn't named.
- No agent-harness disclosed. Similarly, the harness (Cursor / Claude Code / Codex / v0 / custom) is not named. "Same agent harness" is asserted but the harness isn't named.
- Microbenchmark-vs-end-to-end-gap observation is anecdotal — a specific "97 % microbench, 0.02 % real- world" datapoint is cited but not labelled with a specific PR; the underlying bad change was presumably rejected before review.
- Post is Vercel-marketing-adjacent. Closes with a CTA to Turborepo 2.9 release post and "These performance gains are now stable and ready for you to use." Vendor-launch voice bracketing an otherwise substantive engineering retrospective.
- Chrome Trace Event Format limitations claim may be
a narrow interpretation — modern Perfetto's
protobuf-encoded trace format is more compact and richer than the JSON Chrome Trace Event Format the post critiques. The critique is specifically of the JSON variant, not of all trace formats.
Source¶
- Original: https://vercel.com/blog/making-turborepo-ninety-six-percent-faster-with-agents-sandboxes-and-humans
- Raw markdown:
raw/vercel/2026-04-21-making-turborepo-96-faster-with-agents-sandboxes-and-humans-21cc5e16.md
Related¶
- companies/vercel — parent company page; this is the tenth Vercel ingest (six 2026-04-21 Vercel same-day cluster posts + 2024-08-01 Google rendering
- 2026-01-08 v0 composite-pipeline + 2026-04-21 BotID Deep Analysis + this post; the campaign is the Vercel agent-assisted-engineering axis — the eighth Vercel axis after SEO/rendering, agent-reliability, bot-management, platform-runtime, knowledge-agent, content-negotiation, routing-service, and workflow-devkit).
- systems/turborepo — the subject; new system page.
- systems/perfetto / systems/hyperfine — the two profiling + benchmarking tools Shew composes.
- systems/vercel-sandbox — the clean-measurement substrate; prior canonicalisation (from the Knowledge Agent Template ingest) was per-request agent sandbox; this post adds benchmarking-substrate altitude.
- concepts/markdown-as-agent-friendly-format — load- bearing new concept; framework for the markdown-over- JSON shift.
- concepts/chrome-trace-event-format — the thing being replaced; documents the format's structural limitations for agent consumption.
- concepts/sandbox-benchmarking-for-signal-isolation — framework for the Sandbox-based clean-signal story.
- concepts/source-code-as-agent-feedback-loop — the implicit-memory-via-source-code pattern; canonicalises Shew's "your own source code is the best reinforcement learning" framing.
- concepts/agent-hyperfixation-failure-mode / concepts/microbenchmark-vs-end-to-end-gap / concepts/run-to-run-variance — the three named agent/measurement pathologies the campaign canonicalises.
- patterns/markdown-profile-output-for-agents —
canonical pattern with precedent in Bun's
--cpu-prof-md. - patterns/ephemeral-sandbox-benchmark-pair — cross-compile + sandbox + hyperfine A/B workflow.
- patterns/plan-mode-then-implement-agent-loop — canonical supervised-agent performance-engineering loop.
- patterns/agent-spawn-parallel-exploration — unattended fan-out baseline pattern with ~37 % yield at Vercel scale.
- patterns/codebase-correction-as-implicit-feedback — implicit-memory pattern that composes with explicit agent context files.
- patterns/measurement-driven-micro-optimization — parent pattern; this post's supervised Plan-Mode loop is a canonical agent-augmented instance.
- concepts/flamegraph-profiling — prior canonical instance of profile-driven target selection; this post extends the discipline into the markdown-format-for- agent-consumption altitude.
- concepts/monorepo — task-graph construction cost is a monorepo-scale tax the post canonicalises at 8.1 s for 1000+ packages pre-optimisation.
- patterns/agent-driven-benchmark-loop — Cloudflare Agent Memory's agent-proposal loop; Shew's Plan-Mode-then-implement pattern is a performance- engineering sibling with wall-clock A/B measurement as the validator (instead of benchmark-score as validator).