Skip to content

CONCEPT Cited by 1 source

Model unit utilization ratio

Definition

The model unit utilization ratio is the autoscaling signal Databricks uses for LLM serving:

utilization = current_model_units_in_flight / max_model_units_per_replica

When this ratio approaches a configured threshold (e.g., the percent where the engine is at "peak throughput") the autoscaler scales up; when it falls below a lower threshold, it scales down.

Verbatim disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure."

Why this signal, not RPS or concurrency

Three load metrics could in principle drive an LLM serving autoscaler:

Signal Lies under Why
CPU % GPU-bound load CPU% can be low even when the GPU is at the latency cliff
RPS / request_count Variable request shape A pod doing 10 long-context requests is at higher load than one doing 100 short ones
request_concurrency Variable request shape (less so) Better than RPS, but still treats a 5,000-token request the same as a 50-token request
Model unit utilisation None of the above Per-request cost is folded into the load value at admission time

The structural advantage: a pod's non-uniform request cost is folded into the metric, not into the autoscaling rule. The pod emits its current MU load (a single scalar that already accounts for prefill/decode/prefix-cache/multi-modality contributions); the autoscaler reads it and acts.

This is structurally one rung above request_concurrency: both are reactive signals on in-flight work, but request_concurrency treats every request as cost 1, whereas MU utilisation treats each request at its true cost in MUs. For LLM serving where request cost varies non-linearly, MU utilisation is the correct evolution.

The "model-agnostic" property

The post calls this approach "model-agnostic scaling infrastructure." The structural property:

  • For each new model deployed on the platform, the team runs benchmarking once to derive the MU coefficients (α, β, γ) and the max-MUs-per-replica capacity number.
  • After that, the same autoscaling control loop runs across every model — same code, same thresholds, same behaviour. The only thing that changes per-model is the per-replica capacity number.

Compare to a per-model request_concurrency autoscaler, which would need a different per-pod target concurrency for each model — the pod-target derivation has to be redone per model and updated whenever the request shape changes. With MUs, the per-pod number is "max MUs per replica", which is derived from hardware (GPU class) more than from model architecture; the model-side knowledge is folded into the per-request MU cost function.

This is the load-bearing reason the abstraction earns its implementation cost — infrastructure-team knobs collapse from N-models × M-thresholds to M-thresholds.

Composition with policy shape

  • Aggregation: average across pods (the same shape as the request_concurrency-averaged-across-pods design from the Superhuman 2026-05-08 post). Resists single-hot-pod inflation.
  • Asymmetric windows: paired with asymmetric aggressive-up, conservative-down policy for spiky workloads.
  • Bursty-traffic optimisation: the post discloses >80% GPU savings vs static-peak provisioning for bursty workloads on this autoscaler — "For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak."
  • Same primitive feeds the router: MU load per server is what Axon reads for routing decisions; averaged-across-pods MU utilisation is what the autoscaler reads for scaling decisions. Same signal, two control loops — the same architectural property the concepts/request-concurrency-as-autoscaling-signal page notes for request_concurrency + P2C, here generalised to MUs + Axon.

Risks and mitigations (carried over from request-concurrency)

The same failure modes apply — the cost-currency change doesn't fix them, just renormalises them:

  • MU lag during a sharp ramp → aggressive scale-up window.
  • MU drops under partial outage (errors complete fast) → treat error rate as a separate signal.
  • A single hot pod inflates the average → robust-statistic alternatives or per-pod ceilings.
  • MU = 0 across all pods during quiet → minimum-replica floor + scale-down hold-time.

A new failure mode unique to the MU formulation:

  • Bad coefficient calibration — if α, β, γ are stale (model updated, kernels changed), the MU-load-on-pod is biased and the autoscaler scales to the wrong target. Mitigation: re-benchmark on every model/kernel/hardware change.

Open questions

  • Threshold values (the percent of max MUs that triggers scale-up / scale-down) — not disclosed.
  • Hold-time and aggregation window durations — not disclosed.
  • Behaviour around prefix-cache invalidation — if cache-hit-rate collapses (e.g. on cache eviction), per-request MU costs spike; does the autoscaler respond, and how fast?
  • Multi-tier model fleet — does the same threshold apply across hardware classes (H100, NVL72) or are thresholds per-hardware?

Seen in

Last updated · 542 distilled / 1,571 read