Skip to content

PATTERN Cited by 1 source

Stateful LLM session routing

Pattern

Route each LLM workload's requests to a dynamically-assigned subset of replicas (not the full fleet), preserving the session-to-subset binding across requests via an auto-sharder. The subset is large enough for load balancing within it, small enough for KV-prefix-cache locality and bounded blast radius.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Dicer also provides stateful sessions, making request routing sticky. A workload's requests go to only a subset of servers, which improves cache hit rates (crucial for latency-sensitive workloads like coding agents) and limits blast radius."

The pattern is the LLM-serving instance of a more general sticky-routing-for-stateful-systems pattern (already documented as patterns/sticky-routing-for-aggregator-state for metric aggregators on Dicer; this page is the LLM-specific application).

When to use it

  • Latency-sensitive workloads with strong prefix-cache locality — coding agents, conversational LLMs with long history, RAG applications where the same retrieval prefix is reused. KV-cache hit rate dominates p99 TTFT for these.
  • Multi-tenant LLM platforms where one workload's failure modes (silent hangs, CPU spikes from multimodal traffic, OOMs) must not cascade to other tenants. Sticky sessions bound the blast radius.
  • Inference engines that benefit from per-session warm state — speculative decoding draft caches, structured-output state, any per-request engine optimisation that pays a cold-start cost on the first request and amortises across subsequent ones.

When NOT to use it

  • Workloads with no per-session state and no prefix-cache reuse — the stickiness adds complexity without payoff.
  • Very low replica counts (e.g. 3 replicas) where any subset is effectively the full fleet.
  • Workloads that need strict per-request load balancing (rare for LLMs, common for stateless CPU services).

Two purposes simultaneously

The pattern serves two distinct goals on the same primitive:

Purpose 1: Prefix-cache locality

KV-cache (and prefix-cache) is per-replica, not shared. If a workload's requests can land on any of N replicas, the cache hit rate per replica is roughly 1/N of what it would be if the workload landed on one replica. For long-prefix workloads (coding agents):

  • p99 TTFT under random routing: prefill cost on every request except the rare cache hit.
  • p99 TTFT under sticky-to-1: prefill cost on first request, near-zero on subsequent.

Sticky-to-subset is the middle ground: limit dispersion to a small subset, retain N-fold throughput within the subset, retain ~80%+ cache hit rate per replica within subset.

See concepts/kv-cache, concepts/session-affinity-prompt-caching, concepts/prefix-aware-routing.

Purpose 2: Bounded blast radius

A workload with a problematic request shape (triggers a silent hang, saturates CPU preprocessing, OOMs the GPU pod) can only affect its assigned subset, not the full fleet. Sticky routing bounds the operational impact of one tenant's failure mode on other tenants.

Without sticky routing, a workload-induced fault propagates across the fleet as the bad requests fan out via P2C; with sticky routing, only the workload's subset is impacted. The remaining fleet keeps serving.

See concepts/blast-radius.

Structural shape (Dicer-backed)

                 ┌─────────────────────────────────────┐
                 │ Workload (customer / endpoint /     │
                 │ session id)                         │
                 └──────────────┬──────────────────────┘
                                │ key for SliceKey hash
                 ┌─────────────────────────────────────┐
                 │ Dicer Assignment                     │
                 │ - Slice (range of SliceKeys) → pod   │
                 │   subset                             │
                 │ - Reshard via split / merge /        │
                 │   replicate as load shifts           │
                 └──────────────┬──────────────────────┘
                                │ assigned subset
                 ┌─────────────────────────────────────┐
                 │ Axon (router)                        │
                 │ - Within subset, pick best pod by    │
                 │   MU load (cost-based-LB-LLM)        │
                 │ - Forward request                    │
                 └──────────────┬──────────────────────┘
                 ┌─────────────────────────────────────┐
                 │ Inference Runtime                    │
                 │ - Serves request                     │
                 │ - Maintains per-pod prefix cache     │
                 └─────────────────────────────────────┘

The pattern requires four pieces:

  1. A session/workload key that is stable per workload — customer-id, endpoint-id, session-id, or composite.
  2. Dynamic shard assignment mapping the keyspace to pod subsets, with reshard support (split / merge / replicate / move) for adapting to load changes.
  3. A router that does the keyspace lookup and forwards within the assigned subset.
  4. A subset sizing policy that balances cache-hit-rate and blast-radius bound against load-balance.

Dicer provides pieces 2-4 directly; Axon is the router.

Composition

  • With patterns/cost-based-load-balancing-llm: cost-based routing decides which pod within the subset gets the request; sticky sessions decide which subset is eligible.
  • With patterns/sticky-routing-for-aggregator-state: the same Dicer primitive used for metric-aggregator state preservation is reused for LLM session routing. Different application, same underlying pattern.
  • Below: Dicer's state transfer on reshard preserves prefix- cache continuity across rolling restarts (the same property Softstore uses to preserve ~85% hit rate during restarts).

Subset sizing

The Databricks post does not disclose subset-sizing policy. The trade-off curve has three axes:

Subset size Cache hit rate Blast radius Load balance
1 (sticky-to-pod) maximal none (single point of failure) poor
Small (e.g. 3-5) high bounded to subset adequate within subset
Medium (e.g. 10-20) medium broader good
Full fleet low unbounded best

A reasonable design point picks a small-to-medium subset based on:

  • Prefix-cache size on the pod (smaller cache = larger subset acceptable, no cache to lose).
  • Workload's request rate (low rate to a small subset = idle pods; high rate = subset can be smaller).
  • Operational risk tolerance (very-bad-request-shape workloads pinned to small subset for blast-radius bound).

Trade-offs

Compared to… Wins Loses
Random / P2C routing Cache locality; bounded blast radius Worse load balance globally
Hash-based sharding (consistent hashing) Adapts to load via dynamic resharding More complex; needs shard manager
Sticky-to-single-pod Maximal cache hit rate Single-point-of-failure per workload; no within-workload load balance
Sticky at the engine level (engine-internal prefix-aware-routing) Finer granularity (request-level) Doesn't bound blast radius across workloads; needs engine support

Risks and mitigations

  • Subset becomes overloaded by workload growth → workload's latency degrades. Mitigation: dynamic resharding (Dicer's split / replicate) to expand the subset.
  • Hot subset within a customer — one customer's traffic imbalances within their subset. Mitigation: cost-based-LB-LLM on top of the sticky layer; the per-pod MU load metric gets reported and the within-subset routing picks the lowest-loaded pod.
  • Reshard during load → prefix-cache cold start when slice moves → spike in p99 TTFT. Mitigation: Dicer's state transfer during reshard preserves cache continuity.
  • Blast-radius bound depends on subset size — too large a subset weakens the property. Mitigation: cap subset size for high-risk workloads.

Open questions

  • What is the workload key? Customer-id, endpoint-id, session-id, or composite — not disclosed.
  • What's the subset-size policy? Static per workload or dynamic? Sized on which signals?
  • How does state-transfer on reshard work for prefix cache? Per-key cache migration would be expensive on long contexts; partial migration unclear.
  • Spill-over when subset is overloaded — to neighbouring subset, to all-subsets fallback, or reject?

Seen in

Last updated · 542 distilled / 1,571 read