CONCEPT Cited by 1 source

Session-affinity prompt caching (`x-session-affinity`)¶

Definition¶

Session-affinity prompt caching is the LLM-serving pattern of routing subsequent turns of a session back to the same replica / region that served the previous turn, so the previously-computed KV cache for the growing prompt prefix is hit rather than recomputed. The routing key is a client-signalled opaque session token, typically carried as an HTTP header.

Canonical wiki instance: Cloudflare Workers AI uses an x-session-affinity header for this purpose. "We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

See pattern: patterns/session-affinity-header.

Why a client signal is necessary¶

A server-side-only cache strategy has two problems:

Server can't identify "this is a continuation of that earlier session" without some signal — the second turn's payload looks unrelated to the first unless they share an exact prefix hash (see concepts/prefix-aware-routing). With multi-turn agent conversations, the prefix grows each turn but shifts too — an internal prefix-hash is one tool, but explicit client signalling is strictly cheaper to route on than recomputing prefix-hashes every request.
Cross-region routing is a routing decision that upstream proxies make before the serving stack sees the request; the client needs to stamp the session identity upstream so the routing layer can steer.

From the post:

"While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

The discounted cached-tokens pricing flips the incentive for well-behaved clients.

Agentic workloads — why this matters so much¶

Agent sessions have large, growing, highly-reused prompt prefixes:

System prompt (often thousands of tokens).
Tool descriptions + MCP server declarations.
Full conversation history including all prior assistant messages + tool calls.

Each new user turn sends the entire accumulated context. If the request lands on a replica that's never seen this session, the server rebuilds the KV cache for the whole prefix from scratch — a full prefill-pass over thousands-to-hundreds-of-thousands of tokens. With session affinity + warm cache, only the delta (the new user turn) needs to prefill.

Cloudflare captures the cost framing directly: "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Measured effect¶

"We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews."

Peak input-cache-hit ratio: 60% → 80%. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Client adoption pattern¶

Cloudflare works with agent-harness maintainers to add the header:

OpenCode PR #20744 — session-affinity header added to OpenCode's Workers-AI client path.
Offered as a public-docs feature.

This is cheaper than trying to retrofit cache-recognition purely server-side, and scales: each harness integration propagates the pattern across its user base.

Relationship to other KV-reuse primitives¶

Prefix-aware routing — server-side alternative: hash the request prefix, look for a replica with that hash hot. Works without client cooperation, costlier per-request.
KV cache — the resource being hit.
Mooncake Store — the NVMe-persistent tier that extends effective session residency so session affinity pays off even for sessions idle for minutes.
LMCache / SGLang HiCache — cluster-wide cache layers that make same-cluster session affinity unnecessary (a different replica in the same cluster can still hit the shared KV).
PD disaggregation — session affinity must preserve the prefix-KV lineage across the prefill→decode hand-off; the LB has to know which prefill node's KV to point the decode node at.

Cluster-wide KV sharing (via LMCache / SGLang HiCache + Mooncake) makes the within-cluster session-affinity requirement soft — any replica in the cluster can hit shared cache. But:

Across regions, KV cache is not shared; session affinity routes to the region that holds the hot cache.
Under load-balanced failover, if a session crosses cluster boundaries, client-side x-session-affinity keeps the routing coherent.

Caveats¶

No raw adoption numbers across all clients — only the 60%→80% peak hit ratio after heavy-internal-user rollout.
Cached-token discount rate not disclosed — incentive mechanism cited, size not.
Header format and scope (per-conversation? per-tenant? per-model?) not specified in the post beyond the name.
Failure mode when session affinity points at a drained region is not discussed.
Interaction with Mooncake-Store NVMe tier — whether session affinity and NVMe residency compose multiplicatively not characterised.

Seen in¶

sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance; introduces x-session-affinity header, measured 60%→80% hit ratio, incentive (discounted cached tokens), agent-harness integration (OpenCode PR), load-bearing role in agentic workload economics.
sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models — contrasting design point (mentioned for comparison): Databricks ships FMAPI prompt caching as implicit, volatile-only, server-decided — no client signal, no cross-replica sharing, no separate billing line. Where Cloudflare hands the caller a routing-affinity hint to maximise per-customer hit rate, Databricks takes the opposite tradeoff: smaller per-replica cache scope, simpler safety envelope, default-on at the substrate layer.

Session-affinity prompt caching (x-session-affinity)¶