Cloudflare — A good day to trie-hard: saving compute 1% at a time¶
Summary¶
Cloudflare's pingora-origin service — the last Rust-proxy hop before a
request leaves Cloudflare's CDN for the customer origin — runs at
35 M requests/sec globally and burns 40,000 saturated CPU cores
of compute-seconds per second. Stack-trace sampling (~1 in 1,111
samples) flagged a tiny, pleasant-looking helper clear_internal_headers
— which loops over ~100 internal Cloudflare headers and calls
remove_header on each — as consuming 1.71 % of total CPU (≈680
cores). Two successive optimizations brought it down:
(1) flip the lookup direction — iterate request headers (avg 10-30)
and check each against the internal set (a HashMap) instead of
iterating the internal list and probing the request — 3.65 µs → 1.53 µs
(2.39× / 1.0 % CPU saved); (2) ship a custom Rust trie crate
(trie-hard) optimized for the workload (node relationships packed in
unsigned-integer bits, entire tree in one contiguous memory region) —
1.53 µs → 0.93 µs → 1.71 % → 0.43 % CPU (1.28 % saved,
≈550 cores). Production-sampled results match the criterion bench
predictions within a fraction of a percent. The meta-lesson: knowing
where your code is slow matters more than how you optimize — thank
your observability team and read your flame graphs.
Key takeaways¶
- Flame-graph-driven optimization at CDN scale. Cloudflare runs ~60 M HTTP req/sec across the network; pingora-origin handles 35 M/sec of the outbound-to-origin subset, burning 40,000 CPU cores-per-second. Statistical stack-trace sampling surfaces per-function CPU usage directly: 19 samples / 1,111 → 1.71 % → ≈680 cores on one helper. A 1 % saving at this tier is 400 cores, worth real engineering time. (concepts/stack-trace-sampling-profiling, patterns/measurement-driven-micro-optimization)
- Flip lookup direction when the lists are asymmetric. The
original code looped over 100+ internal headers and called
request_header.remove_header(h)for each — 100+ probes per request. Requests carry 10-30 headers, so iterating the request and checking each against the internal set (aHashMap) is an order of magnitude fewer reads. First step alone: 3.65 µs → 1.53 µs (2.39×), producing ~1.0 % CPU savings. The direction flip is the structural win — the container choice that follows is a constant factor. (patterns/flip-lookup-direction) - O(1) hashmap read is a lie at key-length scale. Big-O notation
hides that a hashmap read is constant over table size but linear
over key length (
O(L)) — you have to hash every byte of the key. For ~20-char HTTP header names checked 10-30× per request at 35 M req/sec,O(L)is the actual cost. The goal was a structure that does better than O(L), especially on misses (most headers aren't internal). (concepts/big-o-over-key-length) - Tries give better-than-O(L) behaviour on misses. A trie is a tree-prefix structure where each edge is a character of the key set; at every depth you can eliminate all keys that don't start with that prefix. Read complexity: O(log(L)) on misses, O(L) on hits. >90 % of header checks are misses, so the miss path dominates — exactly the trie's strength. State machines / regex have the same property but feel "between a data structure and a parser"; a trie is the data-structure realization. (concepts/trie-data-structure)
- Off-the-shelf tries were too slow.
crates.iotries (notablyradix_trie, the fastest they benchmarked) are optimized for keyboard / autocomplete workloads, not tens of millions of reads per second on the hot path.radix_trieclocked a full microsecond slower thanHashMap, so the abstract properties didn't pay out in practice. Writing a custom structure was the only path to the goal. (patterns/custom-data-structure-for-hot-path) - trie-hard: bit-packed node relationships + contiguous memory.
Cloudflare's new open-source systems/trie-hard crate stores
inter-node edges as bits of unsigned integers and keeps the
entire tree in one contiguous chunk of memory — both cache-
locality and branch-prediction wins over pointer-chased node-per-
allocation tries. Runtime: 0.93 µs (3.93× vs original,
1.65× vs
HashMap), production sampling: 4 / 1,171 → 0.34 % CPU. Total saving: 1.71 % → 0.43 % = 1.28 % ≈ 550 cores on a 40,000-core fleet. The same cache-locality-over-asymptotic- complexity pattern as Figma's flat-sorted-Vec for Multiplayer property maps at a different workload scale. - Benchmark predictions matched production measurement.
Predicted CPU by scaling
1.71 % × (new-bench-time / old-bench- time): HashMap 0.72 % (actual 0.82 %), trie-hard 0.43 % (actual 0.34 %). Criterion-based microbench ≈ stack-trace-sampling production reading — the benchmarking harness is trustworthy enough to be the primary decision substrate. (systems/criterion-rust) - The meta-lesson is observability, not optimization. "Knowing
where your code is slow and by how much is more important than
how you go about optimizing it." Flame graphs / stack-trace
sampling turned what looked like a routine header-cleanup helper
into a 680-core optimization target. Without the profile, nobody
would touch
clear_internal_headers; with it, the optimization-per-LOC ratio was extreme.
Operational numbers¶
- 60 M HTTP req/sec — Cloudflare network total at time of post.
- 35 M req/sec — pingora-origin outbound-to-origin rate.
- 40,000 saturated CPU cores — pingora-origin compute-seconds/sec.
- 100+ — count of Cloudflare internal headers to strip.
- 10-30 — average HTTP headers per request (uniform distribution of internal vs non-internal in the synthetic bench).
clear_internal_headersbenchmark (Rustcriterion, one request's worth of headers):- Original loop over internal list: 3.65 µs.
- Flip direction +
HashMap<HeaderName>: 1.53 µs (2.39×). BTreeSetvia FST: ~50 ns slower thanHashMap.- Regex / state machine: ~2× slower than
HashMap. radix_triefrom crates.io: ~1 µs slower thanHashMap.- trie-hard: 0.93 µs (3.93× vs original).
- Production CPU-share (stack-trace sampling):
| Implementation | Samples containing clear_internal_headers |
Actual CPU % | Predicted CPU % |
|---|---|---|---|
| Original | 19 / 1,111 | 1.71 % | n/a |
| HashMap | 9 / 1,103 | 0.82 % | 0.72 % |
| trie-hard | 4 / 1,171 | 0.34 % | 0.43 % |
- Net savings: 1.71 % → 0.43 % = 1.28 % of total pingora- origin CPU = ≈ 550 cores on a 40,000-core baseline.
- Deployment: trie-hard has been running in production since July 2024 (post published 2024-09-10).
Caveats¶
- Single-workload post. The "flip lookup direction" and "custom
trie" wins are specific to an asymmetric-set membership hot path
(100 internal names vs 10-30 request headers, >90 % miss rate). At
a different cardinality balance,
HashMapcould be the global optimum. - Microbenchmarks are not workloads. Criterion measures an isolated function call with a synthetic request distribution (uniform internal/non-internal header mix). Cloudflare validates the extrapolation with production stack-trace sampling — the crucial step most optimization posts skip.
- trie-hard is a niche crate. It beats
radix_trieon this specific read-heavy, static-key-set, short-string workload. For write-heavy or very-long-string or dynamic key set workloads, off-the-shelf tries likely still win — the crate's README is explicit about the target shape. - No breakdown of power / dollar savings. The 1.28 % CPU saving translates to ~550 cores in the fleet, but the post doesn't convert that into $/year or W/year — readers should estimate themselves ($X/core/year × 550 at Cloudflare edge pricing).
Links¶
- Original: https://blog.cloudflare.com/pingora-saving-compute-1-percent-at-a-time/
- Raw:
raw/cloudflare/2024-09-10-a-good-day-to-trie-hard-saving-compute-1-at-a-time-f90bdca0.md - Pingora open-source: https://github.com/cloudflare/pingora
- Pingora announcement: https://blog.cloudflare.com/pingora-open-source/
- trie-hard open-source: https://github.com/cloudflare/trie-hard
- Criterion: https://crates.io/crates/criterion
- HN discussion: https://news.ycombinator.com/item?id=41501496 (614 points)
Cross-wiki notes¶
- Cache-locality-beats-asymptotic-complexity is a recurring
wiki theme — Figma's
Multiplayer flat-sorted-Vec (replacing
BTreeMap, O(n)-insert but cache-local on ~640 bytes of contiguous entries) is the second canonical wiki instance after this one. Both cases hinge on a schema-bounded small N and a known workload shape. - Measurement-driven optimization is a wiki-level principle.
Related instances: patterns/custom-benchmarking-harness
(Figma's OpenSearch Go harness written in an afternoon),
concepts/metric-granularity-mismatch (Figma's 8 ms/1 s gap
resolved by reading
tookinstead of per-shard timings), patterns/performance-comparison-with-scientist (GitHub's Scientist library for production A/B). - Pingora + trie-hard anchor Cloudflare company page's pre-AI systems shelf — companies/cloudflare now has both its high-throughput Rust proxy (this post) and its AI engineering stack (sources/2026-04-20-cloudflare-internal-ai-engineering-stack).