Skip to content

CONCEPT Cited by 1 source

Instruction-cache locality

Definition

Instruction-cache locality is the property that a program's hot execution path fits in the CPU's instruction cache + iTLB and is laid out contiguously enough that sequential prefetching keeps the fetch pipeline fed. Equivalently: consecutive cycles of useful work fetch from i-cache hits rather than i-cache misses.

This is the instruction-side sibling of data-cache locality. The two operate on the same CPU microarchitecture principles (capacity, set associativity, TLB translation, prefetch-ahead) but target different caches and are optimised by different tooling.

Why it matters

Modern x86 L1 instruction cache is ~32 KB; L1-iTLB holds a few hundred page entries. When a hot path's byte footprint exceeds i-cache capacity or spans more iTLB entries than fit, every branch becomes a potential miss:

  • L1-i miss — fetch stalls waiting for L2 / L3 / memory. A miss-to-memory can cost 100+ cycles.
  • iTLB miss — page-table walk required to resolve the virtual-to-physical code-page translation. Expensive on transparent-huge-page systems.
  • Branch resteer — taken branch that the predictor missed requires fetching from a different cache line / page; compounds with the above.

The result is a frontend-bound workload (see concepts/frontend-bound-vs-backend-bound-cpu-stall) — decoded instructions aren't reaching the execution backend fast enough.

The Redpanda datum

Redpanda Streaming's 2026-04-02 disclosure (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) is the canonical wiki example. A C++ streaming broker with many code paths, small batches, and per-request CPU overhead exhibits:

  • 51% frontend-bound on the baseline build.
  • 37.9% frontend-bound after PGO applies code-layout optimisations.

Verbatim diagnosis: "frontend-bound means the CPU can't load instructions fast enough for the backend to execute. The root cause is code locality: the hot path is scattered across the executable rather than packed tightly together. This fragments the instruction cache, leading to high-latency memory fetches."

How compilers fix it

Without profile data, compilers guess hot paths from heuristics (loop bodies, __builtin_expect hints, nesting depth). With profile data, compilers make exact decisions:

  1. Basic-block reordering — pack taken-frequently basic blocks into sequential cache lines.
  2. Hot-cold splitting — move rarely-executed blocks / functions to a separate cold section that doesn't compete for hot i-cache capacity.
  3. Aggressive inlining of hot callees — eliminate call / return instructions that disrupt fetch prefetching.
  4. Conservative inlining of cold callees — don't bloat the hot footprint with code that rarely runs.

Redpanda verbatim (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

"Using profile data, the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments. Inlining decisions are also profile-driven, allowing frequently called functions to be inlined more aggressively."

Visual evidence

BOLT provides a heatmap tool that visualises per-12-KiB code access frequency. On Redpanda Streaming:

  • Baseline — hot accesses "scattered throughout the binary" with "many individual hot chunks."
  • PGO-optimized — "all hot functions are packed tightly at the start of the binary, not because the start is special, but because hot code is now concentrated in one place rather than scattered. Access to the rest of the binary is minimal."

The visual shows instruction-cache locality as a packing density property. Verbatim mechanism: "Tighter hot path packing improves instruction cache locality and cuts down on iTLB lookups, which means the CPU spends less time fetching code and more time executing it."

Compared to data-cache locality

Axis Data-cache locality Instruction-cache locality
Cache L1-d (32 KB), L2 (shared) L1-i (32 KB), iTLB, uop cache
Optimisation target Data layout (SoA, cache-line padding) Code layout (hot-cold splitting, reordering)
Canonical signature Memory-bound stalls Frontend-bound stalls
Compiler pass Loop interchange, prefetch PGO basic-block ordering
Post-link tool Rare (data layout is link-time final) BOLT
Canonical wiki source Expedia Kafka Streams colocation Redpanda PGO post

Both share: capacity-and-associativity-dominated behaviour, TLB implications, benefit from contiguous access patterns, insensitive to arithmetic optimisation.

When this matters most

  • Large C++ services (streaming brokers, databases, application servers) — hot path spans hundreds of functions across millions of LoC.
  • Interpreters / VMs — the dispatch loop plus bytecode handlers form a natural scatter pattern.
  • Polymorphic / virtual-call-heavy code — indirect calls defeat sequential prefetching; v-tables become indirect-cache pressure.
  • Microservices with many small RPC handlers — lots of per-call infrastructure code (serialisation, logging, auth, metrics) that dwarfs the actual business logic.

Seen in

Last updated · 470 distilled / 1,213 read