Skip to content

PATTERN Cited by 5 sources

Measurement-driven micro-optimization

Intent

Pick the code worth optimizing by production profiling, not by taste; validate each candidate change against a repeatable benchmark; ship; then re-measure in production to confirm the predicted impact. Do nothing that isn't in that loop.

Context

On any large fleet there are thousands of functions and nobody can hand-pick the CPU hot spots. Developer intuition about "what's slow" is almost always wrong:

  • Pleasant-looking one-line helpers can dominate CPU.
  • Asymptotically-correct code can lose to asymptotically- worse code because of cache locality (concepts/small-map-as-sorted-vec).
  • Stdlib / popular-crate structures may be tuned for a different workload shape than yours.

The only way through is measurement all the way down.

Mechanism

Three complementary instruments, in this order:

  1. Production profiling. Stack-trace sampling on real traffic surfaces the functions with meaningful CPU share. Pick a threshold worth engineering time (e.g., 1 % of fleet CPU).
  2. Per-candidate microbench. Write a criterion-style (or custom) benchmark on a workload-realistic input distribution, run every candidate implementation, and compare.
  3. Post-ship production verification. Deploy the winner. Re-sample production. Compute 1.0 - new/old against predicted 1.0 - new-bench/old-bench. If the predicted and measured numbers match within a fraction of a percent, the benchmark is trustworthy — you can use it as a decision substrate for the next optimization.

The loop compounds: each validated win makes the benchmark more trusted, and each trusted benchmark lets you move faster on the next one.

Complementary pattern: scaling CPU-% linearly

Cloudflare's formula for predicting a candidate's production CPU share:

predicted CPU % = current CPU % × (new-bench-time / old-bench-time)

i.e. scale the profiled CPU share by the ratio of bench times. For pingora-origin clear_internal_headers (2024-09-10):

  • Original: 1.71 % CPU.
  • HashMap-based: predicted 1.71 % × 1.53/3.65 = 0.72 %; measured 0.82 %.
  • trie-hard: predicted 1.71 % × 0.93/3.65 = 0.43 %; measured 0.34 %.

Predictions matched production within ~0.1 % in both cases — criterion and the formula together are a trustworthy methodology.

Canonical instance (Cloudflare, 2024-09-10)

End-to-end:

  1. Stack-trace sampling on pingora-origin flagged clear_internal_headers at 1.71 % of total CPU = ~680 cores across a 40,000-core fleet — a helper that looked routine by inspection.
  2. Criterion microbench swept HashMap / BTreeSet-FST / regex / radix_trie / custom trie-hard on a synthetic request-distribution that matched the real workload shape.
  3. trie-hard won at 0.93 µs (vs 3.65 µs original).
  4. Predicted CPU share: 0.43 %. Actual after July 2024 rollout: 0.34 %. Delta: 0.09 %. Methodology trusted for future runs.

Total saving: 1.28 % of pingora-origin CPU = ~550 cores — for what ended up as one new open-source crate (systems/trie-hard) and a handful of lines changed in clear_internal_headers.

Anti-patterns

  • "We think it's slow" optimization. If the function isn't on the flame graph / sample profile, don't touch it. Developer intuition is systematically biased toward elegant-looking code, not CPU-dominating code.
  • Microbenching without a realistic input distribution. Uniform-random inputs can flatter certain structures; match the workload shape (miss rate, key-length distribution, hit clustering).
  • Single-run benchmarks. Use a harness that reports medians + outliers + CI (criterion does this by default).
  • Skipping the production-verification step. If prediction and measurement diverge, you've either benchmarked the wrong thing or the workload shape is different in production — either way, find out before the next optimization.
  • Premature optimization at low req rates. The budget of engineering time this justifies scales with the CPU savings. At 1k QPS, 1 % CPU is fractions of a core; don't write a crate for it.

Seen in

  • sources/2024-09-10-cloudflare-a-good-day-to-trie-hard — Cloudflare's pingora-origin hot-path optimization. Stack-trace sampling + criterion + production re-sampling produced a 1.28 %-CPU / ~550-core saving with the prediction and measurement matching within 0.1 %.
  • sources/2025-10-14-cloudflare-unpacking-cloudflare-workers-cpu-performance-benchmarks — Cloudflare applies the same discipline to Workers / V8 / OpenNext / Next.js / React. Profiling GC at 10-25 % of request time on the cf-vs-vercel-bench benchmark leads to (a) the V8 young-gen un-tune (~25 % benchmark win globally), (b) four concurrent upstream fixes (V8 JSON.parse-with-reviver patch, Node.js trig compile flag, OpenNext buffer-copy PRs, benchmark-repo PR), and (c) a re-tune of warm-isolate routing for CPU-bound workloads. Same methodology, different substrate. Also demonstrates the bias class sibling concept — a benchmark not matched to the workload's shape can mis-attribute disparity.
  • sources/2026-02-18-datadog-how-we-reduced-agent-go-binaries-up-to-77-percentbinary-size variant of the same loop, different instrument set: profile with systems/go-size-analyzer (byte cost per dep), explain with systems/goda (reach(main, target) import paths), then a hack-first bounding move — comment out every optimization- disabler in the codebase to see the ceiling, then do the real source-level fix to reach it. Yielded 56-77 % per-binary reductions without feature removal. Demonstrates the loop transfers from CPU / memory profiling to size engineering cleanly. Pair with patterns/upstream-the-fix for the ecosystem multiplier (Kubernetes inherited 16-37 % for free).
  • (Related methodology, different substrate) sources/2026-04-21-figma-the-search-for-speed-in-figma — Figma used a different instrument (fix observability first) then a custom Go harness to reproduce the same feedback loop for OpenSearch tuning.
  • sources/2026-04-21-figma-supporting-faster-file-load-times-memory-optimizations-rust — Figma's Multiplayer memory profile identified the BTreeMap<u16, u64> as 60 % of per-file server memory; fix was the flat sorted Vec that beat BTreeMap on both memory and deserialize time.
  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-apiSIMD variant of the same loop on a JVM service. Netflix Performance Engineering identified Ranker's video serendipity-scoring operator at 7.5% of total CPU via flamegraph profiling. Rather than one optimization round, Netflix ran five per-step canary comparisons — batched matmul (regression: +5%), flat buffers + ThreadLocal (recovered), BLAS via netlib-java (regression), JDK Vector API SIMD (win). Each step measured against production traffic before the next step landed; BLAS was discarded because "in the full pipeline" it lost despite microbench promise. Final production measurement confirmed ~7% node CPU drop, ~12% latency drop, ~10% CPU/RPS improvement; per-operator share dropped from 7.5% → ~1%. Canonical wiki instance of the measurement loop applied to (a) a JVM service and (b) a SIMD kernel experiment, with the distinctive caveat that microbench wins don't necessarily survive full-pipeline integration — the BLAS sub-experiment validates why production-verification is step 3 of the loop rather than step 1.

  • sources/2026-04-21-vercel-making-turborepo-96-faster-with-agents-sandboxes-and-humansAgent-augmented altitude. Anthony Shew's 2026-04-21 Turborepo retrospective canonicalises the supervised Plan-Mode-then-implement loop as an agent-augmented variant of this pattern: agent proposes hotspot analysis from Markdown profile output → human reviews proposals → agent implements selected change → hyperfine end-to-end A/B inside ephemeral Vercel Sandbox validates. 20+ PRs in 4 days. The agent-mediated proposal / implementation axis is new; the parent pattern's profile-target-fix-validate discipline stays intact. Names and canonicalises three specific failure modes the human-gate + end-to-end-A/B step filter: hyperfixation, microbenchmark-vs-end-to-end gap, and noise-floor variance that makes 2 % wins indistinguishable from lucky runs on a developer laptop. 91 % Time-to-First-Task improvement on Vercel's 1,000-package monorepo is the one canonical operational datum demonstrating the loop's throughput at agent-augmented altitude.

Last updated · 542 distilled / 1,571 read