Skip to content

PATTERN Cited by 1 source

Cost-based load balancing (LLM)

Pattern

Route LLM requests across replicas using server load measured in model units — not active request count, not RPS — and route through an auto-sharder that supports stateful sessions for prefix-cache locality and blast-radius bounding.

The canonical wiki implementation: Axon, the Databricks LLM data-plane router built on top of Dicer, with model units as the load metric:

"Today, we use Dicer, Databricks' auto-sharder, to dynamically route workloads across servers. Without load-aware routing, long- context requests cause individual servers to become hotspots while others sit underutilized. We integrated model units with Dicer so that routing decisions are based on server load in model units rather than traditional request-based heuristics." (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale)

When to use it

  • High request-cost variance: when request cost can vary by 10× or more across the natural distribution (long-context coding, multimodal, agentic). See concepts/non-uniform-llm-request-cost.
  • Low server count per model: when there are tens to low hundreds of replicas per model deployment, not thousands. P2C works well at large scale; cost-based routing wins at small server count where each routing decision matters.
  • High base latency: when LLM latency is hundreds of milliseconds or seconds, not milliseconds. The cost of a bad routing decision is amortised over a long request, making the decision worth more thought.
  • Capacity contracts to enforce: when the platform offers per-customer reserved capacity in MUs/min — see concepts/multi-tenant-llm-capacity-allocation. The same currency drives admission and routing.

When NOT to use it

  • Uniform-cost workloads: classical CPU services where requests are roughly the same cost. P2C with active_requests is fine and simpler. (See concepts/power-of-two-choices.)
  • Very large server counts: thousands of replicas where the P2C-balls-in-bins probabilistic guarantee dominates other considerations.
  • Stateless serving with no cache locality — if there's no KV prefix cache to preserve and no per-customer affinity to enforce, the simpler P2C path is enough.

Structural pieces

        ┌─────────────────────────────────────┐
        │ Per-pod MU load (current MUs        │
        │ in flight, scalar)                  │
        └────┬────────────────────────────────┘
             │   reported by pod
        ┌─────────────────────────────────────┐
        │ Dicer Assigner                      │
        │ - Slicelet collects per-key load    │
        │ - Updates Assignment based on MU    │
        │   load + health + termination       │
        └────┬────────────────────────────────┘
             │   eventually-consistent assignment
        ┌─────────────────────────────────────┐
        │ Axon (router)                       │
        │ - Lookup: SliceKey(workload) →      │
        │   Assigned pod subset               │
        │ - Sample best pod by MU load        │
        │ - Forward request                   │
        └─────────────────────────────────────┘

The pattern requires three pieces:

  1. A request-cost predictor that emits a model-unit cost at admission time. Coefficients are model+hardware-specific and benchmarked offline.
  2. A pod-side load reporter that aggregates in-flight MUs and exposes the scalar to the router. In Dicer's case, the Slicelet reports per-key load to the Assigner asynchronously.
  3. A router that looks up the assigned pod subset for a request's session/workload key, and routes to the best pod by MU load within that subset. The Axon → Dicer Slicelet → Assigner chain implements this.

Composition with stateful sessions

The pattern composes with patterns/stateful-llm-session-routing — sticky routing within a per-workload pod subset. Together:

  • Cost-based routing decides which pod within the subset gets this request, picking the lowest-MU pod.
  • Stateful sessions decide which subset is eligible, ensuring cache locality and bounded blast radius.

Without the sticky-session layer, cost-based routing is correct but loses prefix-cache hit rate (every request can land on any pod). Without cost-based routing, sticky sessions preserve cache but distribute load unevenly within the subset.

Composition with cost-based autoscaling

The same MU-load primitive feeds the autoscaler — see patterns/model-units-utilization-autoscaling. Same signal, two control loops: per-request routing reads per-pod load; per-fleet autoscaling reads averaged-across-pods load.

Trade-offs

Compared to… Wins Loses
P2C-with-active-requests Correct under non-uniform request cost; better tail latency Coefficient maintenance cost; bad coefficient = bad routing
Round-robin Massively better at high QPS Same coefficient-maintenance cost
Token-aware-LB (Cloudflare) Single scalar instead of per-dimension counts Aggregation hides per-dimension capacity (e.g. can't see prefill-saturated vs decode-saturated separately)
Pure consistent-hashing Better load balance Loses cache-locality unless paired with sticky sessions

Implementation altitude

Cost-based routing in this pattern is implemented at the router altitude (Axon), not at the scheduler altitude (vLLM internal batcher). The two altitudes do different things:

  • Router-altitude (Axon): which pod gets this request?
  • Scheduler-altitude (engine): given requests on this pod, what order do they execute, what batch sizes, etc.?

A correct deployment uses both: cost-based routing at the router altitude, prefix-aware-routing / token-budget-aware batching at the scheduler altitude. The Databricks post discusses the router; vLLM and similar engines handle the scheduler.

Risks and mitigations

  • Bad cost coefficients → biased routing → invisible hotspots. Mitigation: re-benchmark on every model / hardware / kernel change; monitor per-pod p99 for drift.
  • Cost-prediction error on output length → admission-time MU estimate is wrong → in-flight MU load is biased. Mitigation: continuously update per-request cost from observed generation as decode progresses; correct pod load tracking in real time.
  • Subset sized too small → load imbalance within a customer's pods even with good routing. Mitigation: dynamic subset resizing based on workload demand (Dicer's split / merge / replicate primitives).
  • Subset sized too large → blast-radius bounding weakens. Mitigation: sized for cache-hit-rate target and blast-radius bound jointly.

Open questions

  • Subset sizing policy — how is the per-workload pod-subset size chosen? Cache-hit-rate vs blast-radius vs load-balance trade-off curve is not disclosed.
  • Spill-over policy — when the assigned subset is overloaded, does Axon route outside the subset, queue, or reject?
  • Cross-region routing — if a workload's primary region is saturated, can Axon route to another region with intact session semantics?

Seen in

  • sources/2026-05-27-databricks-reliable-llm-inference-at-scale — canonical wiki disclosure of cost-based load balancing for LLM serving via Dicer integrated with model units. Production substrate for 125T+ tokens/month of LLM traffic across frontier OS (Kimi, Qwen) and proprietary (OpenAI, Gemini, Claude) models.
Last updated · 542 distilled / 1,571 read