CONCEPT Cited by 1 source
Session-affinity prompt caching (x-session-affinity)¶
Definition¶
Session-affinity prompt caching is the LLM-serving pattern of routing subsequent turns of a session back to the same replica / region that served the previous turn, so the previously-computed KV cache for the growing prompt prefix is hit rather than recomputed. The routing key is a client-signalled opaque session token, typically carried as an HTTP header.
Canonical wiki instance: Cloudflare Workers AI uses an x-session-affinity header for this purpose. "We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
See pattern: patterns/session-affinity-header.
Why a client signal is necessary¶
A server-side-only cache strategy has two problems:
- Server can't identify "this is a continuation of that earlier session" without some signal — the second turn's payload looks unrelated to the first unless they share an exact prefix hash (see concepts/prefix-aware-routing). With multi-turn agent conversations, the prefix grows each turn but shifts too — an internal prefix-hash is one tool, but explicit client signalling is strictly cheaper to route on than recomputing prefix-hashes every request.
- Cross-region routing is a routing decision that upstream proxies make before the serving stack sees the request; the client needs to stamp the session identity upstream so the routing layer can steer.
From the post:
"While we have KV-aware routing internally, we also rely on clients sending the
x-session-affinityin order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
The discounted cached-tokens pricing flips the incentive for well-behaved clients.
Agentic workloads — why this matters so much¶
Agent sessions have large, growing, highly-reused prompt prefixes:
- System prompt (often thousands of tokens).
- Tool descriptions + MCP server declarations.
- Full conversation history including all prior assistant messages + tool calls.
Each new user turn sends the entire accumulated context. If the request lands on a replica that's never seen this session, the server rebuilds the KV cache for the whole prefix from scratch — a full prefill-pass over thousands-to-hundreds-of-thousands of tokens. With session affinity + warm cache, only the delta (the new user turn) needs to prefill.
Cloudflare captures the cost framing directly: "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Measured effect¶
"We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews."
Peak input-cache-hit ratio: 60% → 80%. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Client adoption pattern¶
Cloudflare works with agent-harness maintainers to add the header:
- OpenCode PR #20744 — session-affinity header added to OpenCode's Workers-AI client path.
- Offered as a public-docs feature.
This is cheaper than trying to retrofit cache-recognition purely server-side, and scales: each harness integration propagates the pattern across its user base.
Relationship to other KV-reuse primitives¶
- Prefix-aware routing — server-side alternative: hash the request prefix, look for a replica with that hash hot. Works without client cooperation, costlier per-request.
- KV cache — the resource being hit.
- Mooncake Store — the NVMe-persistent tier that extends effective session residency so session affinity pays off even for sessions idle for minutes.
- LMCache / SGLang HiCache — cluster-wide cache layers that make same-cluster session affinity unnecessary (a different replica in the same cluster can still hit the shared KV).
- PD disaggregation — session affinity must preserve the prefix-KV lineage across the prefill→decode hand-off; the LB has to know which prefill node's KV to point the decode node at.
Where session affinity still matters after cluster-wide KV sharing¶
Cluster-wide KV sharing (via LMCache / SGLang HiCache + Mooncake) makes the within-cluster session-affinity requirement soft — any replica in the cluster can hit shared cache. But:
- Across regions, KV cache is not shared; session affinity routes to the region that holds the hot cache.
- Under load-balanced failover, if a session crosses cluster boundaries, client-side
x-session-affinitykeeps the routing coherent.
Caveats¶
- No raw adoption numbers across all clients — only the 60%→80% peak hit ratio after heavy-internal-user rollout.
- Cached-token discount rate not disclosed — incentive mechanism cited, size not.
- Header format and scope (per-conversation? per-tenant? per-model?) not specified in the post beyond the name.
- Failure mode when session affinity points at a drained region is not discussed.
- Interaction with Mooncake-Store NVMe tier — whether session affinity and NVMe residency compose multiplicatively not characterised.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance; introduces
x-session-affinityheader, measured 60%→80% hit ratio, incentive (discounted cached tokens), agent-harness integration (OpenCode PR), load-bearing role in agentic workload economics.