CONCEPT Cited by 1 source
Model units (LLM request cost abstraction)¶
Definition¶
A model unit (MU) is a Databricks-coined unit-of-account that quantifies the multi-dimensional cost of an LLM serving request as a single scalar load metric. Routing decisions, autoscaling decisions, and per-customer capacity allocations are all denominated in model units, not in request counts.
The defining quote (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"If we project that a replica can process a fixed number of model units per minute (e.g., 100), we can make the following assumptions: Requests with long input or output consume more model units, since fewer can complete in the same time window. Prefill and decode have different throughput characteristics, so requests with long output cost more than those with long input."
Why an LLM-specific load currency exists at all¶
In classical CPU serving, request cost is roughly uniform — a request that takes 10× longer than another is rare. In LLM serving the cost distribution is structurally non-uniform along several axes:
- Input length — prefill cost scales linearly with input tokens (compute-bound).
- Output length — decode cost scales linearly with output tokens (memory-bandwidth-bound), and output cost > input cost on a per-token basis because every decode step re-reads weights from HBM.
- Prefix caching — a request whose prefix is cached pays no prefill cost on the cached prefix, only on the unique tail.
- Multi-modality — image / audio / video inputs add separate vision-encoder cost on top of the language-model cost; preprocessing is CPU-bound on the same pod (see concepts/multimodal-cpu-bottleneck).
A request-count load metric is blind to all of this. A 5,000-input /
2,000-output coding-agent request and a 50-input / 50-output autocomplete
request count as one request each, but their cost-on-server differs
by >20×. The structural problem with request_count:
- A pod handling 10 short requests is at low load.
- A pod handling 10 long-context coding requests is at near-saturation load.
- Both look identical to a request-count router or autoscaler.
Model units paper over this asymmetry by assigning each request a cost in model units, computed at admission time from the request shape and the per-model + per-hardware cost coefficients. Routing and scaling then track aggregate model units in flight per replica, not aggregate requests in flight per replica. See concepts/non-uniform-llm-request-cost.
The cost function (as disclosed)¶
The cost of a request is modelled as a multi-dimensional function:
where:
- α, β, γ are coefficients determined by automated benchmarking for each model on each hardware type.
- β > α because decode dominates: "requests with long output cost more than those with long input."
- The other features term captures multi-modality contributions and is adjusted for prefix-caching savings: "Model units can be further adjusted for optimizations like prefix caching, and they must account for features like multi-modality."
The post is explicit about the limitation: "Such estimations are structurally imperfect." Model units are a good-enough scalar proxy, not an exact cost model. The structural-imperfection note is load-bearing — the autoscaler and router both budget for the imperfection rather than treating MUs as ground truth.
What model units enable¶
Three distinct uses share the same primitive:
- Cost-based load balancing — the Axon router routes on server load measured in MUs, not in active request count. See patterns/cost-based-load-balancing-llm.
- Cost-based autoscaling — the autoscaler tracks the model unit utilisation ratio (current MUs / max MUs per replica) and scales when it crosses thresholds. See patterns/model-units-utilization-autoscaling.
- Multi-tenant capacity allocation as VM-equivalent guarantees — the unit-of-account that lets a multi-tenant LLM platform sell predictable capacity (e.g. "this customer has 1,000 MUs/min reserved") rather than best-effort capacity ("we'll try to serve you fast"). See concepts/multi-tenant-llm-capacity-allocation.
The single primitive serves both control loops (router + autoscaler) and the customer-facing capacity contract — an unusually broad abstraction. From the post: "a small number of expensive long-context requests can trigger different routing and scaling decisions than many cheap short requests." Same MU-aware view, two control loops.
Relationship to neighbouring concepts¶
| Concept | Relationship |
|---|---|
| concepts/request-concurrency-as-autoscaling-signal | MUs are the next-evolution of request_concurrency for LLM serving. request_concurrency works when requests have approximately uniform cost; MUs are required when they don't. Same family (in-flight-cost autoscaling), different cost unit. |
| concepts/power-of-two-choices | P2C with active_requests is the previous load-metric-on-LLM design. The post explicitly retires it for LLM serving on the latency × server-count × misrouting-cost argument. P2C-with-MUs is conceptually possible but not the design Databricks chose. |
| concepts/token-aware-load-balancing | Cloudflare's per-endpoint prefill-tokens-in-flight + decode-tokens-in-flight signal is the closest sibling. MUs aggregate prefill, decode, prefix-cache, and multi-modality contributions into a single scalar; token-aware LB tracks the dimensions separately. Either can be derived from the other in principle. |
| concepts/non-uniform-llm-request-cost | MUs are the quantification of non-uniform request cost; the concept page describes why requests are non-uniform, this page describes how Databricks turns that into a usable load metric. |
Composition¶
- MUs sit above the inference engine — measured per-request at admission, not derived from internal engine state.
- MUs sit below the customer-facing capacity contract — the customer sees MUs/min reserved, not raw QPS.
- MUs feed two control loops simultaneously:
- Per-request loop: Axon's routing decision (current MU load by server, sample best server).
- Per-fleet loop: autoscaler's scaling decision (mean MU utilisation across servers, scale up/down).
- MUs are re-derived per (model, hardware) pair — coefficient benchmarking happens automatically per model deployment.
Open questions¶
- Coefficient values (α, β, γ) for any model are not disclosed.
- How often coefficients are re-derived — when models update, GPU types change, or kernel sets change.
- Multi-modal coefficient structure — separate per-input-image, per-input-audio coefficients, or rolled into γ.
- Prefix-caching adjustment formula — pre-discount on prefix-hit prediction, or post-discount on observed cache-hit?
- MU rate-limiting altitude — is rate-limiting denominated in MUs/min per customer, or in requests/min plus a separate cost budget? The post says "Requests go through rate limiting before reaching the data plane" but doesn't specify the unit.
- Spill-over behavior — if a customer exceeds their MU allocation, is it bounded (rejected at rate-limit) or best-effort-shed (queued, dropped if congested)?
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale — first canonical wiki disclosure of model units as the multi- dimensional LLM request-cost abstraction underlying Databricks' routing + autoscaling + capacity-allocation control loops at 125T+ tokens/month.
Related¶
- concepts/non-uniform-llm-request-cost — the structural property MUs quantify.
- concepts/model-unit-utilization-ratio — the autoscaling signal derived from MUs.
- concepts/multi-tenant-llm-capacity-allocation — the customer-facing reason MUs exist.
- concepts/prefill-decode-disaggregation — the underlying cost asymmetry MUs encode.
- concepts/kv-cache / concepts/prefix-aware-routing — prefix-caching is a model-unit modifier.
- concepts/request-concurrency-as-autoscaling-signal — the previous-generation signal MUs evolve from.
- concepts/power-of-two-choices — the previous-generation routing primitive MUs evolve from.
- concepts/token-aware-load-balancing — the structurally similar Cloudflare primitive.
- systems/databricks-axon — the router that consumes MUs.
- systems/dicer — the routing substrate; integrated with MUs as the Axon load metric.
- systems/databricks-model-serving — the parent platform.
- patterns/cost-based-load-balancing-llm
- patterns/model-units-utilization-autoscaling