CLOUDFLARE 2024-09-10 Tier 1

Cloudflare — A good day to trie-hard: saving compute 1% at a time¶

Summary¶

Cloudflare's pingora-origin service — the last Rust-proxy hop before a request leaves Cloudflare's CDN for the customer origin — runs at 35 M requests/sec globally and burns 40,000 saturated CPU cores of compute-seconds per second. Stack-trace sampling (~1 in 1,111 samples) flagged a tiny, pleasant-looking helper clear_internal_headers — which loops over ~100 internal Cloudflare headers and calls remove_header on each — as consuming 1.71 % of total CPU (≈680 cores). Two successive optimizations brought it down: (1) flip the lookup direction — iterate request headers (avg 10-30) and check each against the internal set (a HashMap) instead of iterating the internal list and probing the request — 3.65 µs → 1.53 µs (2.39× / 1.0 % CPU saved); (2) ship a custom Rust trie crate (trie-hard) optimized for the workload (node relationships packed in unsigned-integer bits, entire tree in one contiguous memory region) — 1.53 µs → 0.93 µs → 1.71 % → 0.43 % CPU (1.28 % saved, ≈550 cores). Production-sampled results match the criterion bench predictions within a fraction of a percent. The meta-lesson: knowing where your code is slow matters more than how you optimize — thank your observability team and read your flame graphs.

Key takeaways¶

Flame-graph-driven optimization at CDN scale. Cloudflare runs ~60 M HTTP req/sec across the network; pingora-origin handles 35 M/sec of the outbound-to-origin subset, burning 40,000 CPU cores-per-second. Statistical stack-trace sampling surfaces per-function CPU usage directly: 19 samples / 1,111 → 1.71 % → ≈680 cores on one helper. A 1 % saving at this tier is 400 cores, worth real engineering time. (concepts/stack-trace-sampling-profiling, patterns/measurement-driven-micro-optimization)
Flip lookup direction when the lists are asymmetric. The original code looped over 100+ internal headers and called request_header.remove_header(h) for each — 100+ probes per request. Requests carry 10-30 headers, so iterating the request and checking each against the internal set (a HashMap) is an order of magnitude fewer reads. First step alone: 3.65 µs → 1.53 µs (2.39×), producing ~1.0 % CPU savings. The direction flip is the structural win — the container choice that follows is a constant factor. (patterns/flip-lookup-direction)
O(1) hashmap read is a lie at key-length scale. Big-O notation hides that a hashmap read is constant over table size but linear over key length (O(L)) — you have to hash every byte of the key. For ~20-char HTTP header names checked 10-30× per request at 35 M req/sec, O(L) is the actual cost. The goal was a structure that does better than O(L), especially on misses (most headers aren't internal). (concepts/big-o-over-key-length)
Tries give better-than-O(L) behaviour on misses. A trie is a tree-prefix structure where each edge is a character of the key set; at every depth you can eliminate all keys that don't start with that prefix. Read complexity: O(log(L)) on misses, O(L) on hits. >90 % of header checks are misses, so the miss path dominates — exactly the trie's strength. State machines / regex have the same property but feel "between a data structure and a parser"; a trie is the data-structure realization. (concepts/trie-data-structure)
Off-the-shelf tries were too slow. crates.io tries (notably radix_trie, the fastest they benchmarked) are optimized for keyboard / autocomplete workloads, not tens of millions of reads per second on the hot path. radix_trie clocked a full microsecond slower than HashMap, so the abstract properties didn't pay out in practice. Writing a custom structure was the only path to the goal. (patterns/custom-data-structure-for-hot-path)
trie-hard: bit-packed node relationships + contiguous memory. Cloudflare's new open-source systems/trie-hard crate stores inter-node edges as bits of unsigned integers and keeps the entire tree in one contiguous chunk of memory — both cache- locality and branch-prediction wins over pointer-chased node-per- allocation tries. Runtime: 0.93 µs (3.93× vs original, 1.65× vs HashMap), production sampling: 4 / 1,171 → 0.34 % CPU. Total saving: 1.71 % → 0.43 % = 1.28 % ≈ 550 cores on a 40,000-core fleet. The same cache-locality-over-asymptotic- complexity pattern as Figma's flat-sorted-Vec for Multiplayer property maps at a different workload scale.
Benchmark predictions matched production measurement. Predicted CPU by scaling 1.71 % × (new-bench-time / old-bench- time): HashMap 0.72 % (actual 0.82 %), trie-hard 0.43 % (actual 0.34 %). Criterion-based microbench ≈ stack-trace-sampling production reading — the benchmarking harness is trustworthy enough to be the primary decision substrate. (systems/criterion-rust)
The meta-lesson is observability, not optimization. "Knowing where your code is slow and by how much is more important than how you go about optimizing it." Flame graphs / stack-trace sampling turned what looked like a routine header-cleanup helper into a 680-core optimization target. Without the profile, nobody would touch clear_internal_headers; with it, the optimization-per-LOC ratio was extreme.

Operational numbers¶

60 M HTTP req/sec — Cloudflare network total at time of post.
35 M req/sec — pingora-origin outbound-to-origin rate.
40,000 saturated CPU cores — pingora-origin compute-seconds/sec.
100+ — count of Cloudflare internal headers to strip.
10-30 — average HTTP headers per request (uniform distribution of internal vs non-internal in the synthetic bench).
clear_internal_headers benchmark (Rust criterion, one request's worth of headers):
Original loop over internal list: 3.65 µs.
Flip direction + HashMap<HeaderName>: 1.53 µs (2.39×).
BTreeSet via FST: ~50 ns slower than HashMap.
Regex / state machine: ~2× slower than HashMap.
radix_trie from crates.io: ~1 µs slower than HashMap.
trie-hard: 0.93 µs (3.93× vs original).
Production CPU-share (stack-trace sampling):

Implementation	Samples containing `clear_internal_headers`	Actual CPU %	Predicted CPU %
Original	19 / 1,111	1.71 %	n/a
HashMap	9 / 1,103	0.82 %	0.72 %
trie-hard	4 / 1,171	0.34 %	0.43 %

Net savings: 1.71 % → 0.43 % = 1.28 % of total pingora- origin CPU = ≈ 550 cores on a 40,000-core baseline.
Deployment: trie-hard has been running in production since July 2024 (post published 2024-09-10).

Caveats¶

Single-workload post. The "flip lookup direction" and "custom trie" wins are specific to an asymmetric-set membership hot path (100 internal names vs 10-30 request headers, >90 % miss rate). At a different cardinality balance, HashMap could be the global optimum.
Microbenchmarks are not workloads. Criterion measures an isolated function call with a synthetic request distribution (uniform internal/non-internal header mix). Cloudflare validates the extrapolation with production stack-trace sampling — the crucial step most optimization posts skip.
trie-hard is a niche crate. It beats radix_trie on this specific read-heavy, static-key-set, short-string workload. For write-heavy or very-long-string or dynamic key set workloads, off-the-shelf tries likely still win — the crate's README is explicit about the target shape.
No breakdown of power / dollar savings. The 1.28 % CPU saving translates to ~550 cores in the fleet, but the post doesn't convert that into $/year or W/year — readers should estimate themselves ($X/core/year × 550 at Cloudflare edge pricing).

Links¶

Original: https://blog.cloudflare.com/pingora-saving-compute-1-percent-at-a-time/
Raw: raw/cloudflare/2024-09-10-a-good-day-to-trie-hard-saving-compute-1-at-a-time-f90bdca0.md
Pingora open-source: https://github.com/cloudflare/pingora
Pingora announcement: https://blog.cloudflare.com/pingora-open-source/
trie-hard open-source: https://github.com/cloudflare/trie-hard
Criterion: https://crates.io/crates/criterion
HN discussion: https://news.ycombinator.com/item?id=41501496 (614 points)

Cross-wiki notes¶

Cache-locality-beats-asymptotic-complexity is a recurring wiki theme — Figma's Multiplayer flat-sorted-Vec (replacing BTreeMap, O(n)-insert but cache-local on ~640 bytes of contiguous entries) is the second canonical wiki instance after this one. Both cases hinge on a schema-bounded small N and a known workload shape.
Measurement-driven optimization is a wiki-level principle. Related instances: patterns/custom-benchmarking-harness (Figma's OpenSearch Go harness written in an afternoon), concepts/metric-granularity-mismatch (Figma's 8 ms/1 s gap resolved by reading took instead of per-shard timings), patterns/performance-comparison-with-scientist (GitHub's Scientist library for production A/B).
Pingora + trie-hard anchor Cloudflare company page's pre-AI systems shelf — companies/cloudflare now has both its high-throughput Rust proxy (this post) and its AI engineering stack (sources/2026-04-20-cloudflare-internal-ai-engineering-stack).