Skip to content

PATTERN Cited by 1 source

KV-aware routing

Pattern

KV-aware routing is the load-balancing pattern for LLM serving in which the router's target-selection decision is driven by which replica holds the warm KV cache for the request's prefix — rather than simple request-count or token-count load metrics. The goal is maximising cache-hit-ratio on the KV cache, which translates directly to throughput gains and lower TTFT / reduced GPU cost per request.

Canonical wiki reference: Cloudflare Workers AI uses an internal KV-aware router that sits behind the (optional) client-signalled x-session-affinity header. "While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Two flavours

Client-signalled (session-affinity)

Client stamps an opaque per-session token on every turn; router keeps a table of token → replica and routes subsequent requests back to the same replica. See patterns/session-affinity-header + concepts/session-affinity-prompt-caching.

Server-side (prefix-hash based)

Router hashes the request's prompt prefix (or prefix-n-gram), compares against a shared table of which replicas hold which prefix hashes hot, and sends the request to the warmer candidate. See concepts/prefix-aware-routing.

Cloudflare uses both as complementary layers: - Server-side KV-aware routing as the primary mechanism ("we have KV-aware routing internally"). - Client-signalled session affinity as a hint layer on top that cuts the routing cost and aligns cross-region routing.

Relationship to cluster-wide shared KV cache

With a cluster-wide shared KV cache (Mooncake Transfer Engine + LMCache / SGLang HiCache), within-cluster KV-aware routing becomes optional — any replica in the cluster can reach the shared cache:

"This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly."

But KV-aware routing remains hard-required across clusters / regions, where the shared-KV substrate doesn't span. The two layers stack:

Scope Mechanism
Within cluster LMCache / SGLang HiCache + Mooncake — KV-aware routing soft / optional
Across clusters / regions x-session-affinity + server-side KV-aware routing — hard

Companion: PD disaggregation

Under PD disaggregation, the router has to: - Pick a prefill replica for the new request. - Pick a decode replica for the second hop. - Pass the KV cache (identified by reference / hash / token) from prefill to decode.

KV-aware routing in this context means the LB picks a prefill replica that can hit its own cache (for the prefix) and a decode replica reachable over the right KV-transfer substrate. See concepts/token-aware-load-balancing for the companion admission-control axis.

Design considerations

  • Hash granularity — full-prefix hash (rare hits), n-gram / sliding-window hash (more hits but more storage), client-signal (cheapest lookup).
  • Consistency model — routing table staleness tolerance; how long after a replica loses cache does the router learn.
  • Capacity guardrail — routing to the "ideal" replica only works if the replica has capacity; KV-aware routing must compose with token-aware admission to avoid pile-on.
  • Metric-emitter — per-replica cache-hit-ratio telemetry drives the routing decision.

Seen in

Last updated · 200 distilled / 1,178 read