CONCEPT Cited by 1 source
Model unit utilization ratio¶
Definition¶
The model unit utilization ratio is the autoscaling signal Databricks uses for LLM serving:
When this ratio approaches a configured threshold (e.g., the percent where the engine is at "peak throughput") the autoscaler scales up; when it falls below a lower threshold, it scales down.
Verbatim disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):
"Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure."
Why this signal, not RPS or concurrency¶
Three load metrics could in principle drive an LLM serving autoscaler:
| Signal | Lies under | Why |
|---|---|---|
| CPU % | GPU-bound load | CPU% can be low even when the GPU is at the latency cliff |
| RPS / request_count | Variable request shape | A pod doing 10 long-context requests is at higher load than one doing 100 short ones |
request_concurrency |
Variable request shape (less so) | Better than RPS, but still treats a 5,000-token request the same as a 50-token request |
| Model unit utilisation | None of the above | Per-request cost is folded into the load value at admission time |
The structural advantage: a pod's non-uniform request cost is folded into the metric, not into the autoscaling rule. The pod emits its current MU load (a single scalar that already accounts for prefill/decode/prefix-cache/multi-modality contributions); the autoscaler reads it and acts.
This is structurally one rung above
request_concurrency:
both are reactive signals on in-flight work, but request_concurrency
treats every request as cost 1, whereas MU utilisation treats each
request at its true cost in MUs. For LLM serving where request cost
varies non-linearly, MU utilisation is the correct evolution.
The "model-agnostic" property¶
The post calls this approach "model-agnostic scaling infrastructure." The structural property:
- For each new model deployed on the platform, the team runs benchmarking once to derive the MU coefficients (α, β, γ) and the max-MUs-per-replica capacity number.
- After that, the same autoscaling control loop runs across every model — same code, same thresholds, same behaviour. The only thing that changes per-model is the per-replica capacity number.
Compare to a per-model request_concurrency autoscaler, which would
need a different per-pod target concurrency for each model — the
pod-target derivation has to be redone per model and updated whenever
the request shape changes. With MUs, the per-pod number is "max MUs
per replica", which is derived from hardware (GPU class) more
than from model architecture; the model-side knowledge is folded into
the per-request MU cost function.
This is the load-bearing reason the abstraction earns its implementation cost — infrastructure-team knobs collapse from N-models × M-thresholds to M-thresholds.
Composition with policy shape¶
- Aggregation: average across pods (the same shape as the
request_concurrency-averaged-across-pods design from the Superhuman 2026-05-08 post). Resists single-hot-pod inflation. - Asymmetric windows: paired with asymmetric aggressive-up, conservative-down policy for spiky workloads.
- Bursty-traffic optimisation: the post discloses >80% GPU savings vs static-peak provisioning for bursty workloads on this autoscaler — "For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak."
- Same primitive feeds the router: MU load per server is what
Axon reads for routing decisions;
averaged-across-pods MU utilisation is what the autoscaler reads
for scaling decisions. Same signal, two control loops — the
same architectural property the
concepts/request-concurrency-as-autoscaling-signal page notes
for
request_concurrency+ P2C, here generalised to MUs + Axon.
Risks and mitigations (carried over from request-concurrency)¶
The same failure modes apply — the cost-currency change doesn't fix them, just renormalises them:
- MU lag during a sharp ramp → aggressive scale-up window.
- MU drops under partial outage (errors complete fast) → treat error rate as a separate signal.
- A single hot pod inflates the average → robust-statistic alternatives or per-pod ceilings.
- MU = 0 across all pods during quiet → minimum-replica floor + scale-down hold-time.
A new failure mode unique to the MU formulation:
- Bad coefficient calibration — if α, β, γ are stale (model updated, kernels changed), the MU-load-on-pod is biased and the autoscaler scales to the wrong target. Mitigation: re-benchmark on every model/kernel/hardware change.
Open questions¶
- Threshold values (the percent of max MUs that triggers scale-up / scale-down) — not disclosed.
- Hold-time and aggregation window durations — not disclosed.
- Behaviour around prefix-cache invalidation — if cache-hit-rate collapses (e.g. on cache eviction), per-request MU costs spike; does the autoscaler respond, and how fast?
- Multi-tier model fleet — does the same threshold apply across hardware classes (H100, NVL72) or are thresholds per-hardware?
Seen in¶
- sources/2026-05-27-databricks-reliable-llm-inference-at-scale
— first canonical wiki disclosure of
model unit utilization ratioas the autoscaling signal across Databricks' LLM serving fleet. >80% GPU savings vs static-peak on bursty workloads. Model-agnostic infrastructure framing.
Related¶
- concepts/model-units — the load-currency concept this signal is built on.
- concepts/request-concurrency-as-autoscaling-signal — the previous-generation signal this evolves from.
- concepts/reactive-autoscaling — the broader family.
- concepts/spiky-traffic — the workload shape that motivates LLM autoscaling.
- concepts/non-uniform-llm-request-cost — the structural property the signal accommodates.
- patterns/model-units-utilization-autoscaling — the productionised pattern.
- patterns/asymmetric-aggressive-up-conservative-down-autoscaling — composes for spiky traffic.
- systems/databricks-axon — the router that reads the same per-server MU load Axon's autoscaler aggregates.
- systems/databricks-model-serving — the parent platform.