PATTERN Cited by 3 sources
Measurement-driven micro-optimization¶
Intent¶
Pick the code worth optimizing by production profiling, not by taste; validate each candidate change against a repeatable benchmark; ship; then re-measure in production to confirm the predicted impact. Do nothing that isn't in that loop.
Context¶
On any large fleet there are thousands of functions and nobody can hand-pick the CPU hot spots. Developer intuition about "what's slow" is almost always wrong:
- Pleasant-looking one-line helpers can dominate CPU.
- Asymptotically-correct code can lose to asymptotically- worse code because of cache locality (concepts/small-map-as-sorted-vec).
- Stdlib / popular-crate structures may be tuned for a different workload shape than yours.
The only way through is measurement all the way down.
Mechanism¶
Three complementary instruments, in this order:
- Production profiling. Stack-trace sampling on real traffic surfaces the functions with meaningful CPU share. Pick a threshold worth engineering time (e.g., 1 % of fleet CPU).
- Per-candidate microbench. Write a criterion-style (or custom) benchmark on a workload-realistic input distribution, run every candidate implementation, and compare.
- Post-ship production verification. Deploy the winner.
Re-sample production. Compute
1.0 - new/oldagainst predicted1.0 - new-bench/old-bench. If the predicted and measured numbers match within a fraction of a percent, the benchmark is trustworthy — you can use it as a decision substrate for the next optimization.
The loop compounds: each validated win makes the benchmark more trusted, and each trusted benchmark lets you move faster on the next one.
Complementary pattern: scaling CPU-% linearly¶
Cloudflare's formula for predicting a candidate's production CPU share:
predicted CPU % = current CPU % × (new-bench-time / old-bench-time)
i.e. scale the profiled CPU share by the ratio of bench
times. For pingora-origin clear_internal_headers (2024-09-10):
- Original: 1.71 % CPU.
- HashMap-based: predicted 1.71 % × 1.53/3.65 = 0.72 %; measured 0.82 %.
- trie-hard: predicted 1.71 % × 0.93/3.65 = 0.43 %; measured 0.34 %.
Predictions matched production within ~0.1 % in both cases — criterion and the formula together are a trustworthy methodology.
Canonical instance (Cloudflare, 2024-09-10)¶
End-to-end:
- Stack-trace sampling on pingora-origin flagged
clear_internal_headersat 1.71 % of total CPU = ~680 cores across a 40,000-core fleet — a helper that looked routine by inspection. - Criterion microbench swept HashMap / BTreeSet-FST / regex / radix_trie / custom trie-hard on a synthetic request-distribution that matched the real workload shape.
- trie-hard won at 0.93 µs (vs 3.65 µs original).
- Predicted CPU share: 0.43 %. Actual after July 2024 rollout: 0.34 %. Delta: 0.09 %. Methodology trusted for future runs.
Total saving: 1.28 % of pingora-origin CPU = ~550 cores
— for what ended up as one new open-source crate
(systems/trie-hard) and a handful of lines changed in
clear_internal_headers.
Anti-patterns¶
- "We think it's slow" optimization. If the function isn't on the flame graph / sample profile, don't touch it. Developer intuition is systematically biased toward elegant-looking code, not CPU-dominating code.
- Microbenching without a realistic input distribution. Uniform-random inputs can flatter certain structures; match the workload shape (miss rate, key-length distribution, hit clustering).
- Single-run benchmarks. Use a harness that reports medians + outliers + CI (criterion does this by default).
- Skipping the production-verification step. If prediction and measurement diverge, you've either benchmarked the wrong thing or the workload shape is different in production — either way, find out before the next optimization.
- Premature optimization at low req rates. The budget of engineering time this justifies scales with the CPU savings. At 1k QPS, 1 % CPU is fractions of a core; don't write a crate for it.
Seen in¶
- sources/2024-09-10-cloudflare-a-good-day-to-trie-hard — Cloudflare's pingora-origin hot-path optimization. Stack-trace sampling + criterion + production re-sampling produced a 1.28 %-CPU / ~550-core saving with the prediction and measurement matching within 0.1 %.
- sources/2025-10-14-cloudflare-unpacking-cloudflare-workers-cpu-performance-benchmarks
— Cloudflare applies the same discipline to Workers / V8 /
OpenNext / Next.js / React. Profiling GC at 10-25 % of
request time on the
cf-vs-vercel-benchbenchmark leads to (a) the V8 young-gen un-tune (~25 % benchmark win globally), (b) four concurrent upstream fixes (V8 JSON.parse-with-reviver patch, Node.js trig compile flag, OpenNext buffer-copy PRs, benchmark-repo PR), and (c) a re-tune of warm-isolate routing for CPU-bound workloads. Same methodology, different substrate. Also demonstrates the bias class sibling concept — a benchmark not matched to the workload's shape can mis-attribute disparity. - sources/2026-02-18-datadog-how-we-reduced-agent-go-binaries-up-to-77-percent
— binary-size variant of the same loop, different
instrument set: profile with systems/go-size-analyzer
(byte cost per dep), explain with systems/goda
(
reach(main, target)import paths), then a hack-first bounding move — comment out every optimization- disabler in the codebase to see the ceiling, then do the real source-level fix to reach it. Yielded 56-77 % per-binary reductions without feature removal. Demonstrates the loop transfers from CPU / memory profiling to size engineering cleanly. Pair with patterns/upstream-the-fix for the ecosystem multiplier (Kubernetes inherited 16-37 % for free). - (Related methodology, different substrate) sources/2026-04-21-figma-the-search-for-speed-in-figma — Figma used a different instrument (fix observability first) then a custom Go harness to reproduce the same feedback loop for OpenSearch tuning.
- sources/2026-04-21-figma-supporting-faster-file-load-times-memory-optimizations-rust
— Figma's Multiplayer memory profile identified the
BTreeMap<u16, u64>as 60 % of per-file server memory; fix was the flat sorted Vec that beat BTreeMap on both memory and deserialize time.
Related¶
- concepts/stack-trace-sampling-profiling — the production instrument.
- systems/criterion-rust — the Rust microbench substrate.
- patterns/custom-benchmarking-harness — the workload- level complement when per-function microbench doesn't capture the right thing.
- patterns/performance-comparison-with-scientist — GitHub's Scientist library for safe A/B comparison under production traffic.
- concepts/observability — the umbrella discipline.