CONCEPT Cited by 3 sources

KV cache (transformer inference)¶

Definition¶

The KV cache is the per-layer, per-token Key and Value projection tensor store that a transformer decoder reuses across autoregressive generation steps. When generating token t, the model computes the K/V projections for token t once and caches them — every subsequent token t+1, t+2, ... reads those K/V values via attention without recomputing them.

Without the cache, every new token would re-project every prior token (quadratic recompute); with the cache, per-step work stays linear in prompt length and constant per generated token.

Why it dominates serving-side memory¶

Per-request KV-cache memory scales as:

KV_bytes ≈ 2 × num_layers × num_heads × head_dim × seq_len × dtype_bytes

(the 2 is K and V; different model architectures change the constants — GQA / MQA / multi-head-latent-attention reduce the per-head multiplier, but the shape is the same.) For a 70B model with a long context window (100K+ tokens), KV cache per active request can run into tens of GB, which is why GPU memory capacity, not compute, is the binding resource for long-context inference.

Consequence: the KV cache is the largest movable thing in an LLM-serving replica. Its management dominates:

Admission control (how many concurrent requests fit).
Paging policy (evict per-request cache when memory is oversubscribed).
Cross-request reuse (prefix caching: a shared prompt prefix can share its KV tensors across different requests — see concepts/prefix-aware-routing).
Cross-step prefetch (tiered storage — see below).

Tiered KV cache (memory hierarchy)¶

A tiered KV cache spreads the cache across a memory hierarchy:

Tier	Access latency	Capacity	Use
GPU HBM	~ns	tens of GB	hot: currently-decoding requests
Host DRAM	~100 ns — µs (PCIe)	hundreds of GB — TBs	warm: paused / preemptible requests
Local NVMe	~ms	multiple TB	cold: long-lived prefix caches, idle sessions

The server moves KV blocks between tiers on demand: evict from HBM to DRAM when a request is paused, pull back to HBM when it resumes, spill to NVMe for long-lived prefix caches that many requests share. Tier boundaries let serving systems oversubscribe HBM (more concurrent sessions than HBM alone would fit) at the cost of per-step tier-fetch latency.

The primitive is conceptually identical to OS-level VM paging, but applied at the LLM-KV-block granularity with cache-aware eviction policies (LRU tuned for prefix-reuse is the common baseline).

Managed KV cache as a platform feature¶

Historically, KV-cache management has lived inside the model- serving library — vLLM's PagedAttention, TensorRT-LLM's kv-cache reuse, HuggingFace Text-Generation-Inference's cache policies. Different serving libraries expose different knobs; the KV cache was a per-deployment concern.

A managed, platform-provided KV cache — bundled as an add-on capability at cluster-install time — is a shift: the KV-cache policies become platform-wide defaults, memory allocation is tuned per-instance-type by the vendor, and applications don't pick the KV-cache strategy when picking the serving library.

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

"During installation, customers can optionally enable managed tiered KV cache with intelligent memory allocation based on instance types. This feature can reduce inference latency by up to 40% for long-context workloads while optimizing memory utilization across the cluster."

The up to 40% figure is un-methodologied (baseline / workload / percentile / context-length-distribution all unstated) and should not be treated as a benchmark — but the presence of managed KV cache as a platform feature is noteworthy as a primitive-location shift (library → cluster).

Why it pairs with prefix-aware routing¶

Managed KV cache is only useful if requests that share prefixes land on the same replica — otherwise each replica has to rebuild the shared prefix's KV state from scratch. Prefix-aware routing (concepts/prefix-aware-routing) is the companion primitive that sends same-prefix requests to the same replica so the KV cache hits.

The HyperPod Inference Operator ships both at install time — managed tiered KV cache (optional) + intelligent-routing strategy (prefix-aware / KV-aware / round-robin) — as a coupled feature envelope.

Adjacent tradition¶

Prompt caching (Anthropic, OpenAI — exposed to users as a billing-line) is the API-surface of prefix-KV-cache reuse.
Speculative decoding uses a small drafter model to propose N tokens in a single forward pass; the large expert model then verifies them in parallel. Speculative decoding and KV-cache reuse are orthogonal optimisations — speculative decoding reduces the number of large-model forward passes; KV caching reduces the work per forward pass. The parallel verification step is load-bearing on the KV cache: the expert populates K/V for all N draft positions in one pass, which is strictly cheaper than N sequential single-token forwards — see concepts/token-verification and the speculative cascades hybrid (Google Research 2025-09-11) which reuses the same primitive with a probabilistic acceptance rule.
Paged KV cache (vLLM / PagedAttention) — page-granularity allocation of KV state so per-request fragmentation doesn't bloat the cache. Complements tiering: a paged layout within a tier makes cross-tier migration block-granular.

Open questions the post doesn't answer¶

Exact tier composition (HBM + DRAM? HBM + DRAM + NVMe? What about EBS / FSx Lustre?).
Block granularity and eviction policy.
Behaviour under memory pressure — is the tier boundary soft (best-effort prefetch) or hard (block until fetched)?
Per-request vs shared-prefix cache accounting.
Interaction with heterogeneous replicas under instance-type fallback — different instance types have different HBM sizes; does the managed policy tune per-replica or uniformly?
What exactly the 40% baseline is compared against (un-managed KV? non-tiered KV? first-token vs end-to-end latency?).

Cluster-wide shared KV cache over RDMA (Cloudflare Workers AI, 2026-04-16)¶

A parallel industrial instance of platform-provided KV-cache management, at a different layer from SageMaker HyperPod's managed tiered cache: cluster-wide shared KV cache achieved via RDMA transport + persistent NVMe tier + a cache-lookup layer above the serving engine. The Cloudflare Workers AI stack for Kimi K2.5 (>1T parameters, 8× H100 minimum) composes:

Transport — Moonshot AI's Mooncake Transfer Engine moves KV blocks GPU↔GPU / node↔node over NVLink or NVMe-oF RDMA without CPU involvement.
Persistent tier — Mooncake Store extends the cache onto NVMe storage — sessions survive GPU process lifetime; long-lived shared prefixes stay resident far longer.
Cache-lookup layer — LMCache or SGLang HiCache exposes cluster-shared KV to the serving engine (Infire / vLLM / SGLang).

Operational consequence: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Across-cluster routing still depends on client-signalled x-session-affinity. See patterns/kv-aware-routing for the composed routing model.

KV cache under prefill/decode disaggregation ¶

Under PD disaggregation, the KV cache is the state artifact that crosses the inter-stage boundary: the prefill server produces it, the decode server needs it. This makes inter-stage KV transfer latency directly additive to TTFT — requiring the same RDMA substrate as cluster-wide sharing above. See concepts/prefill-decode-disaggregation, patterns/disaggregated-inference-stages.

Cloudflare's measured effect combining PD disaggregation + cluster-wide KV sharing + session affinity: p90 intertoken latency 100 ms → 20-30 ms (3×), same GPU count, higher volume, reduced tail variance. See concepts/intertoken-latency.

Seen in¶

sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — primary source for the managed-tiered-KV-cache-as-platform- feature framing. Sole source at time of writing; expand when future posts disclose tier-internal mechanics.
sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — complementary angle: the KV cache is the structural reason parallel N-token verification is cheaper than sequential decoding in speculative decoding and speculative cascades; the expert's forward pass populates K/V over the entire draft prefix in one pass.
sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — cluster-wide KV-cache sharing via Mooncake Transfer Engine + Mooncake Store + LMCache / SGLang HiCache; KV as the cross-stage artifact under PD disaggregation; session-affinity as the cross-cluster routing-to-warm-cache primitive; concrete peak cache-hit ratios (60% → 80%) and tail-latency numbers (p90 ITL 100 ms → 20-30 ms).
sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models — Databricks FMAPI ships implicit prompt caching as a default-on platform substrate property (concepts/implicit-prompt-caching) on volatile, tenant-isolated memory (concepts/volatile-only-prompt-cache-isolation). Disclosed numerical signature on the GPT-OSS batch-inference rollout: 30% cache hit ratio → +2.5× per-replica throughput, 3× P50 latency reduction — illustrating prefill-skip economics where modest hit rates produce large per-hit savings on prefill-dominated workloads.

concepts/prefix-aware-routing — the companion routing primitive that makes KV-cache reuse pay off.
systems/sagemaker-hyperpod-inference-operator — the canonical managed-KV-cache consumer.
concepts/instance-type-fallback — heterogeneous replica compositions interact with KV-cache sizing.
concepts/speculative-decoding — the optimisation that relies on parallel-populate of the KV cache.
concepts/token-verification — the per-position accept/reject primitive on top of that parallel pass.
systems/speculative-cascades — Google Research's hybrid that keeps the parallel-verify primitive with a generalised acceptance rule.
concepts/prefill-decode-disaggregation — PD disaggregation makes KV transfer an inter-stage primitive.
concepts/session-affinity-prompt-caching / patterns/session-affinity-header — client-signal routing for cross-cluster warm-cache reuse.
concepts/implicit-prompt-caching / concepts/volatile-only-prompt-cache-isolation — the contrasting platform-default caching shape (Databricks FMAPI): no client signal, server-decided cache boundaries, volatile-only multi-tenant isolation. See systems/databricks-fmapi-prompt-caching.
concepts/rdma-kv-transfer — the transport substrate for cluster-wide KV sharing.
systems/mooncake-transfer-engine / systems/mooncake-store — the Moonshot-AI-developed KV substrate Cloudflare consumes.
systems/lmcache / systems/sglang — cluster-shared-KV cache-lookup layers.
systems/infire — serving engine above all these primitives.