CONCEPT Cited by 3 sources
KV cache (transformer inference)¶
Definition¶
The KV cache is the per-layer, per-token Key and Value
projection tensor store that a transformer decoder reuses across
autoregressive generation steps. When generating token t, the
model computes the K/V projections for token t once and caches
them — every subsequent token t+1, t+2, ... reads those K/V
values via attention without recomputing them.
Without the cache, every new token would re-project every prior token (quadratic recompute); with the cache, per-step work stays linear in prompt length and constant per generated token.
Why it dominates serving-side memory¶
Per-request KV-cache memory scales as:
(the 2 is K and V; different model architectures change the constants — GQA / MQA / multi-head-latent-attention reduce the per-head multiplier, but the shape is the same.) For a 70B model with a long context window (100K+ tokens), KV cache per active request can run into tens of GB, which is why GPU memory capacity, not compute, is the binding resource for long-context inference.
Consequence: the KV cache is the largest movable thing in an LLM-serving replica. Its management dominates:
- Admission control (how many concurrent requests fit).
- Paging policy (evict per-request cache when memory is oversubscribed).
- Cross-request reuse (prefix caching: a shared prompt prefix can share its KV tensors across different requests — see concepts/prefix-aware-routing).
- Cross-step prefetch (tiered storage — see below).
Tiered KV cache (memory hierarchy)¶
A tiered KV cache spreads the cache across a memory hierarchy:
| Tier | Access latency | Capacity | Use |
|---|---|---|---|
| GPU HBM | ~ns | tens of GB | hot: currently-decoding requests |
| Host DRAM | ~100 ns — µs (PCIe) | hundreds of GB — TBs | warm: paused / preemptible requests |
| Local NVMe | ~ms | multiple TB | cold: long-lived prefix caches, idle sessions |
The server moves KV blocks between tiers on demand: evict from HBM to DRAM when a request is paused, pull back to HBM when it resumes, spill to NVMe for long-lived prefix caches that many requests share. Tier boundaries let serving systems oversubscribe HBM (more concurrent sessions than HBM alone would fit) at the cost of per-step tier-fetch latency.
The primitive is conceptually identical to OS-level VM paging, but applied at the LLM-KV-block granularity with cache-aware eviction policies (LRU tuned for prefix-reuse is the common baseline).
Managed KV cache as a platform feature¶
Historically, KV-cache management has lived inside the model- serving library — vLLM's PagedAttention, TensorRT-LLM's kv-cache reuse, HuggingFace Text-Generation-Inference's cache policies. Different serving libraries expose different knobs; the KV cache was a per-deployment concern.
A managed, platform-provided KV cache — bundled as an add-on capability at cluster-install time — is a shift: the KV-cache policies become platform-wide defaults, memory allocation is tuned per-instance-type by the vendor, and applications don't pick the KV-cache strategy when picking the serving library.
"During installation, customers can optionally enable managed tiered KV cache with intelligent memory allocation based on instance types. This feature can reduce inference latency by up to 40% for long-context workloads while optimizing memory utilization across the cluster."
The up to 40% figure is un-methodologied (baseline / workload / percentile / context-length-distribution all unstated) and should not be treated as a benchmark — but the presence of managed KV cache as a platform feature is noteworthy as a primitive-location shift (library → cluster).
Why it pairs with prefix-aware routing¶
Managed KV cache is only useful if requests that share prefixes land on the same replica — otherwise each replica has to rebuild the shared prefix's KV state from scratch. Prefix-aware routing (concepts/prefix-aware-routing) is the companion primitive that sends same-prefix requests to the same replica so the KV cache hits.
The HyperPod Inference Operator ships both at install time — managed tiered KV cache (optional) + intelligent-routing strategy (prefix-aware / KV-aware / round-robin) — as a coupled feature envelope.
Adjacent tradition¶
- Prompt caching (Anthropic, OpenAI — exposed to users as a billing-line) is the API-surface of prefix-KV-cache reuse.
- Speculative decoding uses a small drafter model to propose N tokens in a single forward pass; the large expert model then verifies them in parallel. Speculative decoding and KV-cache reuse are orthogonal optimisations — speculative decoding reduces the number of large-model forward passes; KV caching reduces the work per forward pass. The parallel verification step is load-bearing on the KV cache: the expert populates K/V for all N draft positions in one pass, which is strictly cheaper than N sequential single-token forwards — see concepts/token-verification and the speculative cascades hybrid (Google Research 2025-09-11) which reuses the same primitive with a probabilistic acceptance rule.
- Paged KV cache (vLLM / PagedAttention) — page-granularity allocation of KV state so per-request fragmentation doesn't bloat the cache. Complements tiering: a paged layout within a tier makes cross-tier migration block-granular.
Open questions the post doesn't answer¶
- Exact tier composition (HBM + DRAM? HBM + DRAM + NVMe? What about EBS / FSx Lustre?).
- Block granularity and eviction policy.
- Behaviour under memory pressure — is the tier boundary soft (best-effort prefetch) or hard (block until fetched)?
- Per-request vs shared-prefix cache accounting.
- Interaction with heterogeneous replicas under instance-type fallback — different instance types have different HBM sizes; does the managed policy tune per-replica or uniformly?
- What exactly the 40% baseline is compared against (un-managed KV? non-tiered KV? first-token vs end-to-end latency?).
Cluster-wide shared KV cache over RDMA (Cloudflare Workers AI, 2026-04-16)¶
A parallel industrial instance of platform-provided KV-cache management, at a different layer from SageMaker HyperPod's managed tiered cache: cluster-wide shared KV cache achieved via RDMA transport + persistent NVMe tier + a cache-lookup layer above the serving engine. The Cloudflare Workers AI stack for Kimi K2.5 (>1T parameters, 8× H100 minimum) composes:
- Transport — Moonshot AI's Mooncake Transfer Engine moves KV blocks GPU↔GPU / node↔node over NVLink or NVMe-oF RDMA without CPU involvement.
- Persistent tier — Mooncake Store extends the cache onto NVMe storage — sessions survive GPU process lifetime; long-lived shared prefixes stay resident far longer.
- Cache-lookup layer — LMCache or SGLang HiCache exposes cluster-shared KV to the serving engine (Infire / vLLM / SGLang).
Operational consequence: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Across-cluster routing still depends on client-signalled x-session-affinity. See patterns/kv-aware-routing for the composed routing model.
KV cache under prefill/decode disaggregation¶
Under PD disaggregation, the KV cache is the state artifact that crosses the inter-stage boundary: the prefill server produces it, the decode server needs it. This makes inter-stage KV transfer latency directly additive to TTFT — requiring the same RDMA substrate as cluster-wide sharing above. See concepts/prefill-decode-disaggregation, patterns/disaggregated-inference-stages.
Cloudflare's measured effect combining PD disaggregation + cluster-wide KV sharing + session affinity: p90 intertoken latency 100 ms → 20-30 ms (3×), same GPU count, higher volume, reduced tail variance. See concepts/intertoken-latency.
Seen in¶
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — primary source for the managed-tiered-KV-cache-as-platform- feature framing. Sole source at time of writing; expand when future posts disclose tier-internal mechanics.
- sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — complementary angle: the KV cache is the structural reason parallel N-token verification is cheaper than sequential decoding in speculative decoding and speculative cascades; the expert's forward pass populates K/V over the entire draft prefix in one pass.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — cluster-wide KV-cache sharing via Mooncake Transfer Engine + Mooncake Store + LMCache / SGLang HiCache; KV as the cross-stage artifact under PD disaggregation; session-affinity as the cross-cluster routing-to-warm-cache primitive; concrete peak cache-hit ratios (60% → 80%) and tail-latency numbers (p90 ITL 100 ms → 20-30 ms).
Related¶
- concepts/prefix-aware-routing — the companion routing primitive that makes KV-cache reuse pay off.
- systems/sagemaker-hyperpod-inference-operator — the canonical managed-KV-cache consumer.
- concepts/instance-type-fallback — heterogeneous replica compositions interact with KV-cache sizing.
- concepts/speculative-decoding — the optimisation that relies on parallel-populate of the KV cache.
- concepts/token-verification — the per-position accept/reject primitive on top of that parallel pass.
- systems/speculative-cascades — Google Research's hybrid that keeps the parallel-verify primitive with a generalised acceptance rule.
- concepts/prefill-decode-disaggregation — PD disaggregation makes KV transfer an inter-stage primitive.
- concepts/session-affinity-prompt-caching / patterns/session-affinity-header — client-signal routing for cross-cluster warm-cache reuse.
- concepts/rdma-kv-transfer — the transport substrate for cluster-wide KV sharing.
- systems/mooncake-transfer-engine / systems/mooncake-store — the Moonshot-AI-developed KV substrate Cloudflare consumes.
- systems/lmcache / systems/sglang — cluster-shared-KV cache-lookup layers.
- systems/infire — serving engine above all these primitives.