CONCEPT Cited by 1 source

Non-uniform LLM request cost¶

Definition¶

Non-uniform LLM request cost is the structural property of LLM inference where request cost varies by 1-3 orders of magnitude across the natural request distribution, in contrast to classical CPU-service request distributions where cost is approximately uniform.

The Databricks framing (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"the cost to serve a request is highly variable and hard to estimate a priori. […] additionally, latency is dominated by output token generation, but up-front estimation of cost is hard, since it's difficult to predict how long the model will talk for."

The post's Figure 3 caption nails the contrast:

"Cost of a request varies non-linearly and in complex multidimensional ways, depending on the input and output token distribution. This is in sharp contrast to classical AI systems where latency per request is roughly uniformly distributed."

Sources of non-uniformity¶

Five orthogonal dimensions of cost variance, all of which can co-occur on a single request:

Dimension	Effect	Magnitude
Input length	Longer prompts = more prefill compute	Linear in tokens; long-context (100K+) is ~100× short autocomplete (~100 tokens)
Output length	Longer responses = more decode steps	Linear in output tokens; coding-agent responses (~2K) are ~40× short responses (~50)
Prefill vs decode	Decode is more memory-bound, more expensive per token	Often ~2-5× per token
Prefix caching	Cached prefix → near-zero prefill cost	Discounts the entire prefill cost up to cache boundary
Multi-modality	Image / audio / video adds vision-encoder + CPU preprocessing	Multimodal request can be 5-20× cost of equivalent text

These compose multiplicatively. A long-context multimodal coding- agent request can be >100× the cost of a short text-only autocomplete request on the same model and hardware.

Why this breaks classical load metrics¶

Classical load metrics (request count, active requests, RPS) treat every request as cost 1. For non-uniform-cost workloads, this fails in two ways:

Pod load is misrepresented: a pod handling 10 long-context requests is at near-saturation; a pod handling 10 short requests is barely warm. The load metric reports both as "10 active requests".
Misrouting amplifies tail latency: if a router places several expensive requests on the same pod (because count-of-requests looks balanced), that pod's queue depth in real work explodes while neighbouring pods sit idle.

This is the load-bearing argument the Databricks team gives for Axon retiring P2C-with-active-requests:

"LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach."

The combination high latency × low server count × heavy misrouting cost compounds the structural error of a count-based metric on non-uniform-cost workloads.

Why the cost can't be predicted exactly¶

The post acknowledges:

"Up-front estimation of cost is hard, since it's difficult to predict how long the model will talk for."

Output length is input-dependent and stochastic:

Different prompts elicit responses of different lengths.
The same prompt at different temperatures may produce different lengths.
Stop-sequence / max-tokens parameters bound the output but the realised length within the bound is variable.

This means cost-based routing has to use a prediction, not a known cost: predict expected output length from input shape, route accordingly, accept that some routing decisions are made on biased expected-cost estimates. Model units are the productionised cost prediction Databricks chose.

Alternative cost predictors in the literature:

Token-based at admission (predict output ≈ K, K-fixed) — simplest, ignores input dependence.
Length-prediction model — train a small model to predict output length from input. More accurate, more cost.
Adaptive bidding: scheduler tracks per-request observed work and updates pod load estimate continuously.

The post does not specify which prediction strategy MUs use; the cost function α·input + β·output suggests output is also taken from a predictor or from the user-supplied max_tokens.

Quantifying the asymmetry (Databricks framing)¶

The post is explicit that β > α — "requests with long output cost more than those with long input" — but does not disclose the ratio. The cost asymmetry between prefill and decode is widely reported elsewhere as ~2-5× per-token decode-vs-prefill cost (memory-bandwidth-bound vs compute-bound), but exact numbers depend on the model, hardware, and batch policy.

For a hypothetical model with α = 1 and β = 3, a balanced 1,000- input / 1,000-output request costs 1 × 1000 + 3 × 1000 = 4,000 MUs while a 5,000-input / 100-output coding-context request costs 1 × 5000 + 3 × 100 = 5,300 MUs — comparable load despite very different shape. This is the structural property MU-based load balancing is built on.

Operational consequences¶

A serving infrastructure that doesn't account for non-uniform request cost typically exhibits:

Bimodal latency distribution — most requests fast, a small fraction extremely slow (the "expensive request stuck behind another expensive request" tail).
Hotspot pods — random pods become slow under load while others sit idle.
Autoscaler oscillation — request-count-based autoscaler scales out under expensive-heavy traffic, then immediately scales back in when expensive requests finish, creating flapping.
Capacity over-provisioning — operators size for the worst- case mix, leaving expensive tenants happy and cheap tenants paying for headroom they don't need.

Switching to cost-based routing + cost-based autoscaling typically delivers (per the Databricks disclosure):

Tail latency drop, especially for long-context workloads.
Higher utilisation at the same SLO (cheap requests pack denser).
Quieter autoscaler because the load metric reflects real work, not request count.

Composition¶

Above: model units are the cost prediction; capacity allocation is the contract built on top.
Below: prefill/decode disaggregation is one architectural mitigation that physically separates the two cost regimes onto different pods.
Adjacent: concepts/multimodal-cpu-bottleneck is one of the cost dimensions; concepts/kv-cache's prefix-cache hit rate determines another (the prefix-caching discount on prefill).

Open questions¶

Output length prediction model — what does Databricks use, and how accurate is it?
Re-routing on cost-revision — if a request's actual cost diverges from prediction, can the platform re-route or evict? Standard answer is no; the request stays where it landed.
SLO-aware admission — does the platform reject requests that would push a pod over the SLO threshold, or only at the customer- budget level?

Seen in¶

sources/2026-05-27-databricks-reliable-llm-inference-at-scale — canonical wiki disclosure of non-uniform request cost as the structural property that necessitates a new load currency for LLM serving. Figure 3 contrast with classical-AI uniform-cost systems is the most direct framing.

concepts/model-units — the cost-prediction abstraction built on top.
concepts/prefill-decode-disaggregation — physical separation of the two cost regimes.
concepts/kv-cache — the prefix-cache state that contributes to cost variance.
concepts/multimodal-cpu-bottleneck — multimodal as a major cost-variance dimension.
concepts/multi-tenant-llm-capacity-allocation — the customer-facing reason cost prediction matters.
concepts/power-of-two-choices — the load-balancing primitive that's structurally unsuitable here.
concepts/token-aware-load-balancing — the structurally similar Cloudflare primitive that addresses non-uniformity at the token level.
patterns/cost-based-load-balancing-llm — the routing pattern that respects non-uniformity.
systems/databricks-axon — the production router that implements the pattern.