Skip to content

CONCEPT Cited by 1 source

Microbenchmark-vs-end-to-end gap

Definition

The microbenchmark-vs-end-to-end gap is the gulf between an optimisation's measured improvement in a narrow isolated benchmark (a single function's loop) and the same optimisation's improvement on the end-to-end system's wall-clock performance under representative load.

Canonical worst case, verbatim from Anthony Shew's 2026-04-21 Turborepo post reviewing unattended coding-agent output:

"The agent would chase the biggest number it could get, creating microbenchmarks that were relatively meaningless when it came to real-world performance. It would then crank out a 97% improvement for the benchmark, which actually amounted to a 0.02% real-world improvement."

A 97 % microbench win / 0.02 % end-to-end win is the extreme form — the optimised function was on the hot path in the microbenchmark (because the microbenchmark was constructed to exercise it) but off the hot path in the real program (where other code dominates).

Structural causes

  • Amdahl's law. Optimising a function that contributes 1 % of end-to-end runtime can at most save 1 % end-to- end, regardless of how fast the function becomes.
  • Cold vs hot caches. Microbenchmarks with pre- warmed caches don't reflect first-run costs; fixes that win on warm benchmarks may not help the realistic cold path.
  • Allocation patterns. A function that looks identical in isolation may have different allocation behaviour when composed — e.g. a refactor that replaces one large allocation with many small ones may win in the microbench (each small allocation is fast) but lose end-to-end (aggregate allocator pressure + GC frequency).
  • Workload shape. Microbenchmarks typically exercise a single code path; end-to-end performance depends on the distribution of code paths, which can be very different.

Agent-specific pathology

Shew's observation is that unattended coding agents are particularly susceptible to this gap because they chase the biggest available number without an end- to-end validation gate. Five-named-pathology context from the same review:

  1. No dogfood-loop awareness (Turborepo builds Turborepo; the agent didn't use this for end-to-end testing).
  2. Hyperfixation.
  3. Microbenchmark-vs-end-to-end gap (this concept).
  4. No regression tests.
  5. No --profile flag usage.

Without end-to-end validation, an agent reporting "97 % faster!" looks like a triumph but the PR delivers nothing. An engineer reviewing the PR may spot this; an agent auto-merging does not.

Mitigation: end-to-end A/B as the gate

The canonical mitigation is sandbox hyperfine A/B comparison of the full binary under a representative workload — the metric the system cares about (Time to First Task, end-to-end p99, RPS) — not the microbench. An optimisation proposal only merges if the end-to-end number moves.

This aligns with the measurement-driven micro-optimisation parent pattern: pick the target from the profile (functions whose self- time is a material fraction of end-to-end), apply the fix, validate against end-to-end wall-clock.

  • Benchmark representativeness failure (concepts/benchmark-representativeness) — the benchmark itself doesn't reflect production workload shape; end-to-end improvement doesn't generalise.
  • Benchmark methodology bias (concepts/benchmark-methodology-bias) — the benchmark construction unfairly favours one implementation over another.
  • Run-to-run variance (concepts/run-to-run-variance) — even valid end-to-end measurements need enough runs to reject the null hypothesis; single-run wins may be noise.

Microbenchmark-vs-end-to-end gap is orthogonal to these — it's specifically about the mapping between microbench and real workload, not about benchmark construction or measurement noise.

Canonical mitigation pattern verbatim

Shew's supervised loop:

"Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output → Review the proposed optimizations and decide which ones were worth pursuing → Have the agent implement the good proposal(s) → Validate with end-to-end hyperfine benchmarks → Make a PR → Repeat."

The bolded step is the gate that catches the microbench-vs-end-to-end gap.

Seen in

Last updated · 476 distilled / 1,218 read