Skip to content

CONCEPT Cited by 3 sources

KV cache (transformer inference)

Definition

The KV cache is the per-layer, per-token Key and Value projection tensor store that a transformer decoder reuses across autoregressive generation steps. When generating token t, the model computes the K/V projections for token t once and caches them — every subsequent token t+1, t+2, ... reads those K/V values via attention without recomputing them.

Without the cache, every new token would re-project every prior token (quadratic recompute); with the cache, per-step work stays linear in prompt length and constant per generated token.

Why it dominates serving-side memory

Per-request KV-cache memory scales as:

KV_bytes ≈ 2 × num_layers × num_heads × head_dim × seq_len × dtype_bytes

(the 2 is K and V; different model architectures change the constants — GQA / MQA / multi-head-latent-attention reduce the per-head multiplier, but the shape is the same.) For a 70B model with a long context window (100K+ tokens), KV cache per active request can run into tens of GB, which is why GPU memory capacity, not compute, is the binding resource for long-context inference.

Consequence: the KV cache is the largest movable thing in an LLM-serving replica. Its management dominates:

  • Admission control (how many concurrent requests fit).
  • Paging policy (evict per-request cache when memory is oversubscribed).
  • Cross-request reuse (prefix caching: a shared prompt prefix can share its KV tensors across different requests — see concepts/prefix-aware-routing).
  • Cross-step prefetch (tiered storage — see below).

Tiered KV cache (memory hierarchy)

A tiered KV cache spreads the cache across a memory hierarchy:

Tier Access latency Capacity Use
GPU HBM ~ns tens of GB hot: currently-decoding requests
Host DRAM ~100 ns — µs (PCIe) hundreds of GB — TBs warm: paused / preemptible requests
Local NVMe ~ms multiple TB cold: long-lived prefix caches, idle sessions

The server moves KV blocks between tiers on demand: evict from HBM to DRAM when a request is paused, pull back to HBM when it resumes, spill to NVMe for long-lived prefix caches that many requests share. Tier boundaries let serving systems oversubscribe HBM (more concurrent sessions than HBM alone would fit) at the cost of per-step tier-fetch latency.

The primitive is conceptually identical to OS-level VM paging, but applied at the LLM-KV-block granularity with cache-aware eviction policies (LRU tuned for prefix-reuse is the common baseline).

Managed KV cache as a platform feature

Historically, KV-cache management has lived inside the model- serving library — vLLM's PagedAttention, TensorRT-LLM's kv-cache reuse, HuggingFace Text-Generation-Inference's cache policies. Different serving libraries expose different knobs; the KV cache was a per-deployment concern.

A managed, platform-provided KV cache — bundled as an add-on capability at cluster-install time — is a shift: the KV-cache policies become platform-wide defaults, memory allocation is tuned per-instance-type by the vendor, and applications don't pick the KV-cache strategy when picking the serving library.

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

"During installation, customers can optionally enable managed tiered KV cache with intelligent memory allocation based on instance types. This feature can reduce inference latency by up to 40% for long-context workloads while optimizing memory utilization across the cluster."

The up to 40% figure is un-methodologied (baseline / workload / percentile / context-length-distribution all unstated) and should not be treated as a benchmark — but the presence of managed KV cache as a platform feature is noteworthy as a primitive-location shift (library → cluster).

Why it pairs with prefix-aware routing

Managed KV cache is only useful if requests that share prefixes land on the same replica — otherwise each replica has to rebuild the shared prefix's KV state from scratch. Prefix-aware routing (concepts/prefix-aware-routing) is the companion primitive that sends same-prefix requests to the same replica so the KV cache hits.

The HyperPod Inference Operator ships both at install time — managed tiered KV cache (optional) + intelligent-routing strategy (prefix-aware / KV-aware / round-robin) — as a coupled feature envelope.

Adjacent tradition

  • Prompt caching (Anthropic, OpenAI — exposed to users as a billing-line) is the API-surface of prefix-KV-cache reuse.
  • Speculative decoding uses a small drafter model to propose N tokens in a single forward pass; the large expert model then verifies them in parallel. Speculative decoding and KV-cache reuse are orthogonal optimisations — speculative decoding reduces the number of large-model forward passes; KV caching reduces the work per forward pass. The parallel verification step is load-bearing on the KV cache: the expert populates K/V for all N draft positions in one pass, which is strictly cheaper than N sequential single-token forwards — see concepts/token-verification and the speculative cascades hybrid (Google Research 2025-09-11) which reuses the same primitive with a probabilistic acceptance rule.
  • Paged KV cache (vLLM / PagedAttention) — page-granularity allocation of KV state so per-request fragmentation doesn't bloat the cache. Complements tiering: a paged layout within a tier makes cross-tier migration block-granular.

Open questions the post doesn't answer

  • Exact tier composition (HBM + DRAM? HBM + DRAM + NVMe? What about EBS / FSx Lustre?).
  • Block granularity and eviction policy.
  • Behaviour under memory pressure — is the tier boundary soft (best-effort prefetch) or hard (block until fetched)?
  • Per-request vs shared-prefix cache accounting.
  • Interaction with heterogeneous replicas under instance-type fallback — different instance types have different HBM sizes; does the managed policy tune per-replica or uniformly?
  • What exactly the 40% baseline is compared against (un-managed KV? non-tiered KV? first-token vs end-to-end latency?).

Cluster-wide shared KV cache over RDMA (Cloudflare Workers AI, 2026-04-16)

A parallel industrial instance of platform-provided KV-cache management, at a different layer from SageMaker HyperPod's managed tiered cache: cluster-wide shared KV cache achieved via RDMA transport + persistent NVMe tier + a cache-lookup layer above the serving engine. The Cloudflare Workers AI stack for Kimi K2.5 (>1T parameters, 8× H100 minimum) composes:

  • TransportMoonshot AI's Mooncake Transfer Engine moves KV blocks GPU↔GPU / node↔node over NVLink or NVMe-oF RDMA without CPU involvement.
  • Persistent tierMooncake Store extends the cache onto NVMe storage — sessions survive GPU process lifetime; long-lived shared prefixes stay resident far longer.
  • Cache-lookup layerLMCache or SGLang HiCache exposes cluster-shared KV to the serving engine (Infire / vLLM / SGLang).

Operational consequence: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Across-cluster routing still depends on client-signalled x-session-affinity. See patterns/kv-aware-routing for the composed routing model.

KV cache under prefill/decode disaggregation

Under PD disaggregation, the KV cache is the state artifact that crosses the inter-stage boundary: the prefill server produces it, the decode server needs it. This makes inter-stage KV transfer latency directly additive to TTFT — requiring the same RDMA substrate as cluster-wide sharing above. See concepts/prefill-decode-disaggregation, patterns/disaggregated-inference-stages.

Cloudflare's measured effect combining PD disaggregation + cluster-wide KV sharing + session affinity: p90 intertoken latency 100 ms → 20-30 ms (3×), same GPU count, higher volume, reduced tail variance. See concepts/intertoken-latency.

Seen in

Last updated · 200 distilled / 1,178 read