Skip to content

CONCEPT Cited by 1 source

Session-affinity prompt caching (x-session-affinity)

Definition

Session-affinity prompt caching is the LLM-serving pattern of routing subsequent turns of a session back to the same replica / region that served the previous turn, so the previously-computed KV cache for the growing prompt prefix is hit rather than recomputed. The routing key is a client-signalled opaque session token, typically carried as an HTTP header.

Canonical wiki instance: Cloudflare Workers AI uses an x-session-affinity header for this purpose. "We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

See pattern: patterns/session-affinity-header.

Why a client signal is necessary

A server-side-only cache strategy has two problems:

  1. Server can't identify "this is a continuation of that earlier session" without some signal — the second turn's payload looks unrelated to the first unless they share an exact prefix hash (see concepts/prefix-aware-routing). With multi-turn agent conversations, the prefix grows each turn but shifts too — an internal prefix-hash is one tool, but explicit client signalling is strictly cheaper to route on than recomputing prefix-hashes every request.
  2. Cross-region routing is a routing decision that upstream proxies make before the serving stack sees the request; the client needs to stamp the session identity upstream so the routing layer can steer.

From the post:

"While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

The discounted cached-tokens pricing flips the incentive for well-behaved clients.

Agentic workloads — why this matters so much

Agent sessions have large, growing, highly-reused prompt prefixes:

  • System prompt (often thousands of tokens).
  • Tool descriptions + MCP server declarations.
  • Full conversation history including all prior assistant messages + tool calls.

Each new user turn sends the entire accumulated context. If the request lands on a replica that's never seen this session, the server rebuilds the KV cache for the whole prefix from scratch — a full prefill-pass over thousands-to-hundreds-of-thousands of tokens. With session affinity + warm cache, only the delta (the new user turn) needs to prefill.

Cloudflare captures the cost framing directly: "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Measured effect

"We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews."

Peak input-cache-hit ratio: 60% → 80%. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Client adoption pattern

Cloudflare works with agent-harness maintainers to add the header:

This is cheaper than trying to retrofit cache-recognition purely server-side, and scales: each harness integration propagates the pattern across its user base.

Relationship to other KV-reuse primitives

  • Prefix-aware routing — server-side alternative: hash the request prefix, look for a replica with that hash hot. Works without client cooperation, costlier per-request.
  • KV cache — the resource being hit.
  • Mooncake Store — the NVMe-persistent tier that extends effective session residency so session affinity pays off even for sessions idle for minutes.
  • LMCache / SGLang HiCache — cluster-wide cache layers that make same-cluster session affinity unnecessary (a different replica in the same cluster can still hit the shared KV).
  • PD disaggregation — session affinity must preserve the prefix-KV lineage across the prefill→decode hand-off; the LB has to know which prefill node's KV to point the decode node at.

Where session affinity still matters after cluster-wide KV sharing

Cluster-wide KV sharing (via LMCache / SGLang HiCache + Mooncake) makes the within-cluster session-affinity requirement soft — any replica in the cluster can hit shared cache. But:

  • Across regions, KV cache is not shared; session affinity routes to the region that holds the hot cache.
  • Under load-balanced failover, if a session crosses cluster boundaries, client-side x-session-affinity keeps the routing coherent.

Caveats

  • No raw adoption numbers across all clients — only the 60%→80% peak hit ratio after heavy-internal-user rollout.
  • Cached-token discount rate not disclosed — incentive mechanism cited, size not.
  • Header format and scope (per-conversation? per-tenant? per-model?) not specified in the post beyond the name.
  • Failure mode when session affinity points at a drained region is not discussed.
  • Interaction with Mooncake-Store NVMe tier — whether session affinity and NVMe residency compose multiplicatively not characterised.

Seen in

Last updated · 200 distilled / 1,178 read