Skip to content

PATTERN Cited by 3 sources

Measurement-driven micro-optimization

Intent

Pick the code worth optimizing by production profiling, not by taste; validate each candidate change against a repeatable benchmark; ship; then re-measure in production to confirm the predicted impact. Do nothing that isn't in that loop.

Context

On any large fleet there are thousands of functions and nobody can hand-pick the CPU hot spots. Developer intuition about "what's slow" is almost always wrong:

  • Pleasant-looking one-line helpers can dominate CPU.
  • Asymptotically-correct code can lose to asymptotically- worse code because of cache locality (concepts/small-map-as-sorted-vec).
  • Stdlib / popular-crate structures may be tuned for a different workload shape than yours.

The only way through is measurement all the way down.

Mechanism

Three complementary instruments, in this order:

  1. Production profiling. Stack-trace sampling on real traffic surfaces the functions with meaningful CPU share. Pick a threshold worth engineering time (e.g., 1 % of fleet CPU).
  2. Per-candidate microbench. Write a criterion-style (or custom) benchmark on a workload-realistic input distribution, run every candidate implementation, and compare.
  3. Post-ship production verification. Deploy the winner. Re-sample production. Compute 1.0 - new/old against predicted 1.0 - new-bench/old-bench. If the predicted and measured numbers match within a fraction of a percent, the benchmark is trustworthy — you can use it as a decision substrate for the next optimization.

The loop compounds: each validated win makes the benchmark more trusted, and each trusted benchmark lets you move faster on the next one.

Complementary pattern: scaling CPU-% linearly

Cloudflare's formula for predicting a candidate's production CPU share:

predicted CPU % = current CPU % × (new-bench-time / old-bench-time)

i.e. scale the profiled CPU share by the ratio of bench times. For pingora-origin clear_internal_headers (2024-09-10):

  • Original: 1.71 % CPU.
  • HashMap-based: predicted 1.71 % × 1.53/3.65 = 0.72 %; measured 0.82 %.
  • trie-hard: predicted 1.71 % × 0.93/3.65 = 0.43 %; measured 0.34 %.

Predictions matched production within ~0.1 % in both cases — criterion and the formula together are a trustworthy methodology.

Canonical instance (Cloudflare, 2024-09-10)

End-to-end:

  1. Stack-trace sampling on pingora-origin flagged clear_internal_headers at 1.71 % of total CPU = ~680 cores across a 40,000-core fleet — a helper that looked routine by inspection.
  2. Criterion microbench swept HashMap / BTreeSet-FST / regex / radix_trie / custom trie-hard on a synthetic request-distribution that matched the real workload shape.
  3. trie-hard won at 0.93 µs (vs 3.65 µs original).
  4. Predicted CPU share: 0.43 %. Actual after July 2024 rollout: 0.34 %. Delta: 0.09 %. Methodology trusted for future runs.

Total saving: 1.28 % of pingora-origin CPU = ~550 cores — for what ended up as one new open-source crate (systems/trie-hard) and a handful of lines changed in clear_internal_headers.

Anti-patterns

  • "We think it's slow" optimization. If the function isn't on the flame graph / sample profile, don't touch it. Developer intuition is systematically biased toward elegant-looking code, not CPU-dominating code.
  • Microbenching without a realistic input distribution. Uniform-random inputs can flatter certain structures; match the workload shape (miss rate, key-length distribution, hit clustering).
  • Single-run benchmarks. Use a harness that reports medians + outliers + CI (criterion does this by default).
  • Skipping the production-verification step. If prediction and measurement diverge, you've either benchmarked the wrong thing or the workload shape is different in production — either way, find out before the next optimization.
  • Premature optimization at low req rates. The budget of engineering time this justifies scales with the CPU savings. At 1k QPS, 1 % CPU is fractions of a core; don't write a crate for it.

Seen in

Last updated · 200 distilled / 1,178 read