PATTERN Cited by 1 source
Session-affinity header¶
Pattern¶
Session-affinity header is the LLM-serving pattern of asking clients to carry a per-session opaque token as an HTTP header on every turn, used by the serving infrastructure to route the request back to the replica / region that served the previous turn, so the previously-computed KV cache is hit rather than rebuilt. Adoption is incentivised via pricing (discounted cached tokens).
Canonical wiki instance: Cloudflare Workers AI's x-session-affinity header. "We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Why a client-side signal (vs server-side only)¶
Pure server-side prefix-based KV routing (prefix-aware routing) is possible but has two costs:
- Prefix-hashing every request adds CPU cost at the LB tier.
- Cross-region routing decisions are made by upstream proxies before the serving stack; those proxies can route on simple headers but not on computed prefix hashes without heavy instrumentation.
A client-signalled affinity header is cheaper to route on and propagates naturally through proxy chains.
Cloudflare's framing:
"While we have KV-aware routing internally, we also rely on clients sending the
x-session-affinityin order to be explicit about prompt caching."
Pricing incentive¶
Without a pricing incentive, clients have no reason to invest engineering effort to add the header. Cloudflare offers discounted cached tokens — the per-token price when the session hits a warm cache is lower than the full-prefill price, passing some of the operator savings back to the customer:
"We incentivize the use of the header by offering discounted cached tokens. We highly encourage users to leverage prompt caching in order to have faster inference and cheaper pricing."
The incentive mechanism flips the adoption calculus: adding the header is strictly cheaper for the customer and produces lower latency. Adoption becomes viral.
Agent-harness adoption pattern¶
Cloudflare works with maintainers of popular agent harnesses to integrate the header:
- OpenCode PR #20744 — session-affinity header added to OpenCode's Workers-AI client path; "we noticed a significant increase in total throughput."
The pattern scales: one harness-level PR propagates the header across that harness's entire user base. Similar integrations in [AI code reviewers] and other tools.
Result¶
"We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times."
Peak input-cache-hit ratio 60% → 80%. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Implementation contract¶
The minimum viable implementation:
- Client — mints a session token at conversation start (UUID, opaque); sends
x-session-affinity: <token>on every request of that conversation. - Routing layer — hashes the token to a region / replica; records which replica served a given token. Subsequent requests with the same token route to the same target.
- Fallback — if the target replica is drained or overloaded, route to an alternative; the session loses its warm cache for that turn but the request still succeeds.
- Server — serves the request. If the KV cache for the session prefix is hot, reuse it; otherwise, prefill.
- Pricing — the server meters cached-input-tokens vs full-input-tokens separately, applying the discount.
The header is opaque to the server — it only needs to identify same-session requests, not carry any semantic payload.
Design considerations¶
- Token scope — per-conversation, per-tenant, or per-application-session? Smaller scope = more routing churn but better cache locality; larger scope = worse locality but more routing flexibility.
- Cross-region routing — the header has to be visible to the edge proxy layer, not only to the LB inside a region.
- Failure-mode — when the target replica is down, fall back gracefully without erroring the request.
- Privacy — the token must not leak session content; opaque is the right default.
- Interaction with cluster-wide KV cache (Mooncake + LMCache) — intra-cluster the header becomes soft because any replica can reach the shared KV; cross-cluster it remains hard.
Caveats¶
- Client-side changes required — customers must opt in.
- Adoption bottleneck — customers who don't integrate pay full prefill cost and pressure the operator's GPU budget.
- Pricing discount size not disclosed in the Cloudflare post — incentive mechanism cited, dollar magnitude not.
- Header token format not specified beyond the name.
When to use¶
- LLM-serving stacks with multi-region / multi-cluster deployments where pure within-cluster KV-sharing isn't enough.
- Agent workloads where the same session makes many turns with growing prompts; the per-turn delta cost dominates.
When not to use¶
- Single-cluster, single-replica deployments where session affinity is trivial (everything is local).
- Workloads where sessions are short (one or two turns) — affinity pays off proportional to turn count.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance of the pattern; Cloudflare Workers AI
x-session-affinitywith measured 60%→80% peak hit-ratio and OpenCode PR integration.
Related¶
- concepts/session-affinity-prompt-caching — the concept page.
- concepts/kv-cache — the resource the hit is against.
- concepts/prefix-aware-routing — server-side alternative.
- concepts/agent-context-window — the growing-prompt shape that makes session-affinity valuable.
- systems/workers-ai
- companies/cloudflare