CONCEPT Cited by 1 source

Hot-cold code splitting¶

Definition¶

Hot-cold code splitting is a compiler / binary-optimiser transformation that separates frequently-executed and rarely-executed parts of a function (or of the whole binary) into distinct memory regions. The effect: rarely-executed code (error handlers, rare branches, debug-only paths) is evicted from the hot execution path's instruction-cache footprint, improving i-cache density and reducing frontend- bound stalls.

Enabled by PGO, LLVM BOLT, and similar feedback-directed optimisation passes.

Two scopes¶

Hot-cold splitting operates at two scopes:

Intra-function splitting — within a single function, the rarely-taken branches (e.g. the fail: label of an error-check pattern) are moved out of the function's hot body and into a separate .text.cold section. The hot body becomes smaller and contiguous.
Function-level splitting — across the binary, rarely-called functions are relocated to a cold region of .text, leaving only hot functions in the main region. The i-cache "working set" shrinks.

Redpanda's canonical framing (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

"PGO addresses this directly. Using profile data, the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments."

Why splitting works¶

L1 instruction cache is a small, fixed-capacity resource (~32 KB on modern x86). Every byte of code in the hot path competes for that capacity. If an if (unlikely(err)) { ... 100 lines of error handling ... } block sits inline in a hot function, those 100 lines of rarely-executed code sit in i-cache every time the function runs — paying for their capacity footprint without delivering executed-instruction value.

Splitting moves that 100-line block to a cold section. When the error path is taken (rarely), the i-cache fetch incurs a miss — acceptable because it's rare. When the error path is not taken (the common case), the cold block never occupies the hot footprint.

The iTLB (instruction TLB) pays a similar benefit: fewer unique 4 KB pages touched in hot execution → higher iTLB hit rate. Verbatim from Redpanda 2026-04-02 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "Tighter hot path packing improves instruction cache locality and cuts down on iTLB lookups, which means the CPU spends less time fetching code and more time executing it."

Visual evidence: the binary heatmap¶

BOLT provides a tool that visualises per-12-KiB code-access frequency across the binary. Before-and-after heatmaps from Redpanda's 2026-04-02 post show the effect in pictures:

Baseline: "access is scattered throughout the binary. While there are bands of hotter code, there are many individual hot chunks."
PGO-optimized: "all hot functions are packed tightly at the start of the binary, not because the start is special, but because hot code is now concentrated in one place rather than scattered."

Verbatim explanation (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "yellow is significantly hotter in the PGO case, confirming denser, more concentrated code access despite there being less red."

Enabling this in practice¶

Clang / LLVM PGO: the -fprofile-use flag on phase-2 rebuild automatically splits hot-cold at both scopes.
GCC: -freorder-blocks-and-partition (enabled by default at -O2 with profile data).
LLVM BOLT: the -split-functions flag applies intra-function splitting post-link.

The cost¶

Splitting is not free:

Binary size grows — rarely-executed cold code is still present, now with extra section boundaries. Typical impact: +0-5%.
Cold-path first-call latency — when an error path or rare branch fires for the first time, it incurs an i-cache miss (possibly paging the cold section in from disk). Acceptable trade-off for rare paths.
Debug symbol complexity — split functions may require debugger awareness to rejoin hot + cold segments for stack traces.

Seen in¶

sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki source. Redpanda 26.1 C++ broker uses clang PGO's hot-cold splitting + block grouping + profile- driven inlining; measured 13-point reduction in TMA frontend-bound.

concepts/profile-guided-optimization — the compile-time vehicle.
concepts/llvm-bolt-post-link-optimizer — the post-link vehicle.
concepts/instruction-cache-locality — the property splitting optimises.
concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA axis that indicates splitting will pay off.
concepts/feedback-directed-optimization — the umbrella family.
systems/clang / systems/llvm-bolt / systems/meta-bolt-binary-optimizer — the tooling.
systems/redpanda — Tier-3 canonical example.
patterns/pgo-for-frontend-bound-application — the apply pattern.