PATTERN Cited by 1 source
Stateful LLM session routing¶
Pattern¶
Route each LLM workload's requests to a dynamically-assigned subset of replicas (not the full fleet), preserving the session-to-subset binding across requests via an auto-sharder. The subset is large enough for load balancing within it, small enough for KV-prefix-cache locality and bounded blast radius.
Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"Dicer also provides stateful sessions, making request routing sticky. A workload's requests go to only a subset of servers, which improves cache hit rates (crucial for latency-sensitive workloads like coding agents) and limits blast radius."
The pattern is the LLM-serving instance of a more general sticky-routing-for-stateful-systems pattern (already documented as patterns/sticky-routing-for-aggregator-state for metric aggregators on Dicer; this page is the LLM-specific application).
When to use it¶
- Latency-sensitive workloads with strong prefix-cache locality — coding agents, conversational LLMs with long history, RAG applications where the same retrieval prefix is reused. KV-cache hit rate dominates p99 TTFT for these.
- Multi-tenant LLM platforms where one workload's failure modes (silent hangs, CPU spikes from multimodal traffic, OOMs) must not cascade to other tenants. Sticky sessions bound the blast radius.
- Inference engines that benefit from per-session warm state — speculative decoding draft caches, structured-output state, any per-request engine optimisation that pays a cold-start cost on the first request and amortises across subsequent ones.
When NOT to use it¶
- Workloads with no per-session state and no prefix-cache reuse — the stickiness adds complexity without payoff.
- Very low replica counts (e.g. 3 replicas) where any subset is effectively the full fleet.
- Workloads that need strict per-request load balancing (rare for LLMs, common for stateless CPU services).
Two purposes simultaneously¶
The pattern serves two distinct goals on the same primitive:
Purpose 1: Prefix-cache locality¶
KV-cache (and prefix-cache) is per-replica, not shared. If a workload's requests can land on any of N replicas, the cache hit rate per replica is roughly 1/N of what it would be if the workload landed on one replica. For long-prefix workloads (coding agents):
- p99 TTFT under random routing: prefill cost on every request except the rare cache hit.
- p99 TTFT under sticky-to-1: prefill cost on first request, near-zero on subsequent.
Sticky-to-subset is the middle ground: limit dispersion to a small subset, retain N-fold throughput within the subset, retain ~80%+ cache hit rate per replica within subset.
See concepts/kv-cache, concepts/session-affinity-prompt-caching, concepts/prefix-aware-routing.
Purpose 2: Bounded blast radius¶
A workload with a problematic request shape (triggers a silent hang, saturates CPU preprocessing, OOMs the GPU pod) can only affect its assigned subset, not the full fleet. Sticky routing bounds the operational impact of one tenant's failure mode on other tenants.
Without sticky routing, a workload-induced fault propagates across the fleet as the bad requests fan out via P2C; with sticky routing, only the workload's subset is impacted. The remaining fleet keeps serving.
Structural shape (Dicer-backed)¶
┌─────────────────────────────────────┐
│ Workload (customer / endpoint / │
│ session id) │
└──────────────┬──────────────────────┘
│ key for SliceKey hash
▼
┌─────────────────────────────────────┐
│ Dicer Assignment │
│ - Slice (range of SliceKeys) → pod │
│ subset │
│ - Reshard via split / merge / │
│ replicate as load shifts │
└──────────────┬──────────────────────┘
│ assigned subset
▼
┌─────────────────────────────────────┐
│ Axon (router) │
│ - Within subset, pick best pod by │
│ MU load (cost-based-LB-LLM) │
│ - Forward request │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Inference Runtime │
│ - Serves request │
│ - Maintains per-pod prefix cache │
└─────────────────────────────────────┘
The pattern requires four pieces:
- A session/workload key that is stable per workload — customer-id, endpoint-id, session-id, or composite.
- Dynamic shard assignment mapping the keyspace to pod subsets, with reshard support (split / merge / replicate / move) for adapting to load changes.
- A router that does the keyspace lookup and forwards within the assigned subset.
- A subset sizing policy that balances cache-hit-rate and blast-radius bound against load-balance.
Dicer provides pieces 2-4 directly; Axon is the router.
Composition¶
- With patterns/cost-based-load-balancing-llm: cost-based routing decides which pod within the subset gets the request; sticky sessions decide which subset is eligible.
- With patterns/sticky-routing-for-aggregator-state: the same Dicer primitive used for metric-aggregator state preservation is reused for LLM session routing. Different application, same underlying pattern.
- Below: Dicer's state transfer on reshard preserves prefix- cache continuity across rolling restarts (the same property Softstore uses to preserve ~85% hit rate during restarts).
Subset sizing¶
The Databricks post does not disclose subset-sizing policy. The trade-off curve has three axes:
| Subset size | Cache hit rate | Blast radius | Load balance |
|---|---|---|---|
| 1 (sticky-to-pod) | maximal | none (single point of failure) | poor |
| Small (e.g. 3-5) | high | bounded to subset | adequate within subset |
| Medium (e.g. 10-20) | medium | broader | good |
| Full fleet | low | unbounded | best |
A reasonable design point picks a small-to-medium subset based on:
- Prefix-cache size on the pod (smaller cache = larger subset acceptable, no cache to lose).
- Workload's request rate (low rate to a small subset = idle pods; high rate = subset can be smaller).
- Operational risk tolerance (very-bad-request-shape workloads pinned to small subset for blast-radius bound).
Trade-offs¶
| Compared to… | Wins | Loses |
|---|---|---|
| Random / P2C routing | Cache locality; bounded blast radius | Worse load balance globally |
| Hash-based sharding (consistent hashing) | Adapts to load via dynamic resharding | More complex; needs shard manager |
| Sticky-to-single-pod | Maximal cache hit rate | Single-point-of-failure per workload; no within-workload load balance |
| Sticky at the engine level (engine-internal prefix-aware-routing) | Finer granularity (request-level) | Doesn't bound blast radius across workloads; needs engine support |
Risks and mitigations¶
- Subset becomes overloaded by workload growth → workload's latency degrades. Mitigation: dynamic resharding (Dicer's split / replicate) to expand the subset.
- Hot subset within a customer — one customer's traffic imbalances within their subset. Mitigation: cost-based-LB-LLM on top of the sticky layer; the per-pod MU load metric gets reported and the within-subset routing picks the lowest-loaded pod.
- Reshard during load → prefix-cache cold start when slice moves → spike in p99 TTFT. Mitigation: Dicer's state transfer during reshard preserves cache continuity.
- Blast-radius bound depends on subset size — too large a subset weakens the property. Mitigation: cap subset size for high-risk workloads.
Open questions¶
- What is the workload key? Customer-id, endpoint-id, session-id, or composite — not disclosed.
- What's the subset-size policy? Static per workload or dynamic? Sized on which signals?
- How does state-transfer on reshard work for prefix cache? Per-key cache migration would be expensive on long contexts; partial migration unclear.
- Spill-over when subset is overloaded — to neighbouring subset, to all-subsets fallback, or reject?
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale — first canonical wiki disclosure of stateful (sticky) session routing for LLM serving on Databricks. Two-purpose framing (cache-hit-rate + blast-radius) on Dicer substrate via Axon. Implicit case for coding-agent latency-sensitive workloads.
Related¶
- concepts/sticky-routing — the broader concept.
- concepts/kv-cache / concepts/session-affinity-prompt-caching / concepts/prefix-aware-routing — the cache-locality companion concepts.
- concepts/blast-radius — the operational property being bounded.
- concepts/multi-tenant-llm-capacity-allocation — the customer-facing context.
- systems/databricks-axon / systems/dicer — the canonical implementation.
- systems/databricks-model-serving — the parent platform.
- patterns/cost-based-load-balancing-llm — composes within the subset.
- patterns/sticky-routing-for-aggregator-state — the sibling Dicer-application pattern at the metrics altitude.