Skip to content

CONCEPT Cited by 1 source

CPU cache hierarchy (L1 / L2 / L3)

Definition

Modern CPUs do not read directly from RAM. They read from a hierarchy of small, fast on-die caches that sit between the execution units and main memory:

  • L1 — closest to the core (~1 cycle access), smallest (typically 32–64 KB per core, split I-cache / D-cache).
  • L2 — still per-core or shared across a small group (~10 cycles access), larger (typically 256 KB – 1 MB per core).
  • L3 — shared across all cores of a socket (~30–40 cycles access), largest (typically tens of MB per socket).
  • RAM — ~100–300 cycles access, GB scale. The backing store the CPU cache hierarchy sits on top of.

The principle is identical to every other caching tier on the wiki: pair a small amount of expensive fast storage with a large amount of cheap slow storage. (Source: sources/2025-07-08-planetscale-caching.)

Ben Dicken's framing

From the 2025-07-08 post:

Modern CPUs have one or more cache layers for RAM. Though RAM is fast, a cache built directly into the CPU is even faster, so frequently used values and variables can be stored there while a program is running to improve performance.

Most modern CPUs have multiple of these cache layers, referred to as L1, L2, and L3 cache. L1 is faster than L2 which is faster than L3, but L1 has less capacity than L2, which has less capacity than L3.

And the general principle he names:

This is often the tradeoff with caching — Faster data lookup means more cost or more size limitations due to how physically close the data needs to be to the requester. It's all tradeoffs. Getting great performance out of a computer is a careful balancing act of tuning cache layers to be optimal for the workload.

Why it exists: physics

  • Wire delay. The speed of electrical signals across silicon limits round-trip time. L1 caches sit adjacent to the execution pipeline; L3 caches span the die; RAM is on a separate DIMM on the motherboard. Each hop costs picoseconds-to-nanoseconds of propagation alone.
  • Capacity ↔ area. SRAM cells (used for caches) are ~6 transistors per bit; DRAM cells (used for RAM) are 1 transistor + 1 capacitor per bit. SRAM is ~6× less dense, so a larger cache eats proportionally more die area (and therefore cost and power).
  • Power. Closer caches have shorter wires + smaller arrays → much lower per-access energy.

The layered design trades size for speed at each step — exactly the same trade-off storage media make further down the hierarchy (RAM → NVMe → EBS → HDD → tape).

Cache lines and spatial locality

The CPU cache doesn't store individual bytes — it stores fixed-size cache lines (typically 64 bytes on x86-64). Reading byte x pulls bytes x through x+63 into cache.

This is why data layout matters for performance: row-major vs column-major traversal, struct-of-arrays vs array-of-structs, contiguous buffers vs pointer-chasing. The algorithm's theoretical complexity is not the only factor — a cache-unfriendly layout can multiply real-world latency by 10× or more. See concepts/spatial-locality-prefetching and the Cloudflare trie-hard + Netflix Ranker examples on the wiki.

Coherence and the illusion of shared memory

When multiple cores each have their own L1/L2, they don't see each other's in-flight writes until cache-coherence traffic propagates them. The MESI protocol (and variants) make the multi-core system behave as though it shares memory, at a real cost — coherence traffic, cache-line "ping-ponging" on contended writes, and false sharing (unrelated variables on the same cache line causing spurious invalidation traffic).

This is why concepts/concurrent-data-structure performance is so hardware-sensitive: what looks like trivial atomic increments can be dominated by cache-line contention rather than by the atomic op itself.

Programmer-visible consequences

  • Hot-loop code should fit in L1-I. Large inlined functions can displace other hot code.
  • Working-set sizing. If your inner-loop data fits in L1, you're memory-bandwidth-free; if it fits in L2 but spills L1, you pay ~10× more per access; if it spills L3 you pay ~50–100× more.
  • Unlikely branches cost prefetch mispredicts.
  • False sharing on adjacent atomic variables can silently tank scaling curves.
  • Memory ordering matters. Stronger memory models (x86 TSO) let you assume more; weaker ones (ARM, POWER) require explicit barriers.

Seen in

  • sources/2025-07-08-planetscale-caching — Ben Dicken canonicalises the three-tier CPU cache vocabulary + size-vs-speed framing + the general "faster = closer = more expensive and smaller" trade-off.

Wiki corpus examples that hit this layer in production:

Last updated · 319 distilled / 1,201 read