Skip to content

CLOUDFLARE 2024-09-10 Tier 1

Read original ↗

Cloudflare — A good day to trie-hard: saving compute 1% at a time

Summary

Cloudflare's pingora-origin service — the last Rust-proxy hop before a request leaves Cloudflare's CDN for the customer origin — runs at 35 M requests/sec globally and burns 40,000 saturated CPU cores of compute-seconds per second. Stack-trace sampling (~1 in 1,111 samples) flagged a tiny, pleasant-looking helper clear_internal_headers — which loops over ~100 internal Cloudflare headers and calls remove_header on each — as consuming 1.71 % of total CPU (≈680 cores). Two successive optimizations brought it down: (1) flip the lookup direction — iterate request headers (avg 10-30) and check each against the internal set (a HashMap) instead of iterating the internal list and probing the request — 3.65 µs → 1.53 µs (2.39× / 1.0 % CPU saved); (2) ship a custom Rust trie crate (trie-hard) optimized for the workload (node relationships packed in unsigned-integer bits, entire tree in one contiguous memory region) — 1.53 µs → 0.93 µs1.71 % → 0.43 % CPU (1.28 % saved, ≈550 cores). Production-sampled results match the criterion bench predictions within a fraction of a percent. The meta-lesson: knowing where your code is slow matters more than how you optimize — thank your observability team and read your flame graphs.

Key takeaways

  1. Flame-graph-driven optimization at CDN scale. Cloudflare runs ~60 M HTTP req/sec across the network; pingora-origin handles 35 M/sec of the outbound-to-origin subset, burning 40,000 CPU cores-per-second. Statistical stack-trace sampling surfaces per-function CPU usage directly: 19 samples / 1,111 → 1.71 % → ≈680 cores on one helper. A 1 % saving at this tier is 400 cores, worth real engineering time. (concepts/stack-trace-sampling-profiling, patterns/measurement-driven-micro-optimization)
  2. Flip lookup direction when the lists are asymmetric. The original code looped over 100+ internal headers and called request_header.remove_header(h) for each — 100+ probes per request. Requests carry 10-30 headers, so iterating the request and checking each against the internal set (a HashMap) is an order of magnitude fewer reads. First step alone: 3.65 µs → 1.53 µs (2.39×), producing ~1.0 % CPU savings. The direction flip is the structural win — the container choice that follows is a constant factor. (patterns/flip-lookup-direction)
  3. O(1) hashmap read is a lie at key-length scale. Big-O notation hides that a hashmap read is constant over table size but linear over key length (O(L)) — you have to hash every byte of the key. For ~20-char HTTP header names checked 10-30× per request at 35 M req/sec, O(L) is the actual cost. The goal was a structure that does better than O(L), especially on misses (most headers aren't internal). (concepts/big-o-over-key-length)
  4. Tries give better-than-O(L) behaviour on misses. A trie is a tree-prefix structure where each edge is a character of the key set; at every depth you can eliminate all keys that don't start with that prefix. Read complexity: O(log(L)) on misses, O(L) on hits. >90 % of header checks are misses, so the miss path dominates — exactly the trie's strength. State machines / regex have the same property but feel "between a data structure and a parser"; a trie is the data-structure realization. (concepts/trie-data-structure)
  5. Off-the-shelf tries were too slow. crates.io tries (notably radix_trie, the fastest they benchmarked) are optimized for keyboard / autocomplete workloads, not tens of millions of reads per second on the hot path. radix_trie clocked a full microsecond slower than HashMap, so the abstract properties didn't pay out in practice. Writing a custom structure was the only path to the goal. (patterns/custom-data-structure-for-hot-path)
  6. trie-hard: bit-packed node relationships + contiguous memory. Cloudflare's new open-source systems/trie-hard crate stores inter-node edges as bits of unsigned integers and keeps the entire tree in one contiguous chunk of memory — both cache- locality and branch-prediction wins over pointer-chased node-per- allocation tries. Runtime: 0.93 µs (3.93× vs original, 1.65× vs HashMap), production sampling: 4 / 1,171 → 0.34 % CPU. Total saving: 1.71 % → 0.43 % = 1.28 % ≈ 550 cores on a 40,000-core fleet. The same cache-locality-over-asymptotic- complexity pattern as Figma's flat-sorted-Vec for Multiplayer property maps at a different workload scale.
  7. Benchmark predictions matched production measurement. Predicted CPU by scaling 1.71 % × (new-bench-time / old-bench- time): HashMap 0.72 % (actual 0.82 %), trie-hard 0.43 % (actual 0.34 %). Criterion-based microbench ≈ stack-trace-sampling production reading — the benchmarking harness is trustworthy enough to be the primary decision substrate. (systems/criterion-rust)
  8. The meta-lesson is observability, not optimization. "Knowing where your code is slow and by how much is more important than how you go about optimizing it." Flame graphs / stack-trace sampling turned what looked like a routine header-cleanup helper into a 680-core optimization target. Without the profile, nobody would touch clear_internal_headers; with it, the optimization-per-LOC ratio was extreme.

Operational numbers

  • 60 M HTTP req/sec — Cloudflare network total at time of post.
  • 35 M req/sec — pingora-origin outbound-to-origin rate.
  • 40,000 saturated CPU cores — pingora-origin compute-seconds/sec.
  • 100+ — count of Cloudflare internal headers to strip.
  • 10-30 — average HTTP headers per request (uniform distribution of internal vs non-internal in the synthetic bench).
  • clear_internal_headers benchmark (Rust criterion, one request's worth of headers):
  • Original loop over internal list: 3.65 µs.
  • Flip direction + HashMap<HeaderName>: 1.53 µs (2.39×).
  • BTreeSet via FST: ~50 ns slower than HashMap.
  • Regex / state machine: ~2× slower than HashMap.
  • radix_trie from crates.io: ~1 µs slower than HashMap.
  • trie-hard: 0.93 µs (3.93× vs original).
  • Production CPU-share (stack-trace sampling):
Implementation Samples containing clear_internal_headers Actual CPU % Predicted CPU %
Original 19 / 1,111 1.71 % n/a
HashMap 9 / 1,103 0.82 % 0.72 %
trie-hard 4 / 1,171 0.34 % 0.43 %
  • Net savings: 1.71 % → 0.43 % = 1.28 % of total pingora- origin CPU = ≈ 550 cores on a 40,000-core baseline.
  • Deployment: trie-hard has been running in production since July 2024 (post published 2024-09-10).

Caveats

  • Single-workload post. The "flip lookup direction" and "custom trie" wins are specific to an asymmetric-set membership hot path (100 internal names vs 10-30 request headers, >90 % miss rate). At a different cardinality balance, HashMap could be the global optimum.
  • Microbenchmarks are not workloads. Criterion measures an isolated function call with a synthetic request distribution (uniform internal/non-internal header mix). Cloudflare validates the extrapolation with production stack-trace sampling — the crucial step most optimization posts skip.
  • trie-hard is a niche crate. It beats radix_trie on this specific read-heavy, static-key-set, short-string workload. For write-heavy or very-long-string or dynamic key set workloads, off-the-shelf tries likely still win — the crate's README is explicit about the target shape.
  • No breakdown of power / dollar savings. The 1.28 % CPU saving translates to ~550 cores in the fleet, but the post doesn't convert that into $/year or W/year — readers should estimate themselves ($X/core/year × 550 at Cloudflare edge pricing).

Cross-wiki notes

Last updated · 200 distilled / 1,178 read