PATTERN Cited by 1 source

Model-units utilization autoscaling¶

Pattern¶

Drive LLM-serving autoscaling decisions from the model unit utilisation ratio averaged across pods, with per-replica capacity targets benchmarked per (model, hardware) pair, and with the same control-loop running across every model — no per-model threshold tuning.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure."

When to use it¶

LLM serving where the platform serves many models on shared infrastructure, and per-model autoscaling-rule tuning would scale poorly with the model catalog.
Workloads with non-uniform request cost (long-context, multimodal, agentic) where request_concurrency would systematically misrepresent load.
Bursty traffic shapes where >80% GPU savings vs static-peak provisioning is the operational goal.
Multi-tenant platforms where the same currency (MUs) is used for capacity allocation and for autoscaling, so allocation and scaling agree on what "load" means.

When NOT to use it¶

Single-model serving with uniform request shape — simpler request_concurrency autoscaling is correct and cheaper.
Latency-bounded workloads where direct latency-SLO autoscaling is acceptable (and the team has the operational discipline to manage SLO-direct autoscaling's lagging-signal failure mode).
Workloads without disclosed MU coefficients (i.e. before the benchmarking pass has been done for this model).

Structural shape¶

                ┌──────────────────────────────────┐
                │ Per-pod MU load                  │
                │ (current_in_flight_MUs / max_MUs)│
                └────────────┬─────────────────────┘
                             │   per-pod scalar
                             ▼
                ┌──────────────────────────────────┐
                │ Aggregate (average across pods)  │
                │ over short window                │
                └────────────┬─────────────────────┘
                             │   utilisation ratio
                             ▼
                ┌──────────────────────────────────┐
                │ Asymmetric thresholds:           │
                │  - util > T_up sustained → up    │
                │  - util < T_down sustained → down│
                │  - else hold                     │
                └────────────┬─────────────────────┘
                             │   replica delta
                             ▼
                ┌──────────────────────────────────┐
                │ Provisioner (Kubernetes / equiv) │
                │ spawns or terminates pods        │
                └──────────────────────────────────┘

The pattern requires four pieces:

Per-pod MU load reporter — the inference engine emits current_in_flight_MUs continuously.
Per-replica MU capacity number — max_MUs_per_replica benchmarked per (model, hardware). Derived from a load test that ramps the pod until p99 latency starts to climb, then backs off slightly. Re-derived on model / kernel / hardware changes.
Aggregator — averages MU utilisation across pods over a short window. Average dampens single-hot-pod inflation.
Threshold-pair policy — asymmetric scale-up / scale-down thresholds with hold-times, typically pairing with asymmetric aggressive-up, conservative-down.

Composition¶

Same MU primitive feeds patterns/cost-based-load-balancing-llm — the router reads per-pod MU load for routing; the autoscaler reads averaged MU utilisation for scaling. Same signal, two control loops.
Composes with patterns/asymmetric-aggressive-up-conservative-down-autoscaling — the threshold-pair policy that handles spiky traffic without flapping.
Below the autoscaler: model units (the cost currency) and the utilisation ratio (the signal).

Quantitative payoff¶

Disclosed: >80% GPU savings vs static provisioning at peak for bursty-traffic workloads:

"Building autoscaling on top of LLM inference patterns saved us from always scaling to max replicas. For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak."

The 80% number is the headline argument for this pattern's value: on workloads where peak is 5-10× average, static-peak provisioning wastes most of the capacity outside the peak. Cost-based autoscaling tracks demand. The savings are realised because:

Frontier GPUs are expensive (the post calls them out as a cost-prohibitive provisioning decision).
Bursty workloads (agentic, working-hours-coupled) idle most of the time.
The platform can pack MU-budget-respecting customers into shared pods elastically.

Trade-offs¶

Compared to…	Wins	Loses
`request_concurrency`	Correct under non-uniform cost	Coefficient-benchmarking cost; bad coefficient = bad scaling
Latency-SLO direct autoscaling	Leading signal vs lagging	Less directly aligned with SLO; needs threshold tuning
CPU%-based	Always correct (CPU% lies for GPU work)	n/a
Static-peak provisioning	80%+ cost savings on bursty workloads	Can't deliver guaranteed peak-burst latency without buffer
Predictive (forecast-based)	Always correct (reactive lags ramp)	Less accurate on novel traffic shapes; needs forecast model

Risks and mitigations¶

Coefficient drift — model updates / kernel changes / GPU changes can invalidate the per-replica capacity number. Mitigation: re-benchmark on every change.
Burst overshoots target before scale-up reacts — the same reactive-autoscaling failure mode that exists for request_concurrency. Mitigation: aggressive scale-up window; small minimum-replica floor; combine with predictive autoscaling for known-spiky periods.
Errors complete fast and drop the signal — under partial outage, MU load drops and the autoscaler scales down spuriously. Mitigation: error rate as a separate scaling-veto signal.
Single hot pod inflates the average — pod doing one expensive request looks ~saturated; average is biased high. Mitigation: trimmed mean / median statistic instead of mean.

Open questions¶

Threshold values (T_up, T_down) — not disclosed.
Window durations (sustained-up, sustained-down) — not disclosed.
Scale-up step size — adds 1 pod per loop or many? Not disclosed.
Multi-tier hardware — does one threshold work across H100, NVL72, or are thresholds per-hardware?

Seen in¶

sources/2026-05-27-databricks-reliable-llm-inference-at-scale — canonical wiki disclosure of model-unit-utilisation-ratio autoscaling on Databricks' LLM serving platform. >80% GPU savings vs static-peak provisioning for bursty workloads. Model-agnostic scaling infrastructure framing.

concepts/model-units — the load currency.
concepts/model-unit-utilization-ratio — the autoscaling signal.
concepts/request-concurrency-as-autoscaling-signal — the previous-generation signal this evolves from.
concepts/spiky-traffic — the workload shape this pattern targets.
concepts/reactive-autoscaling — the broader family.
patterns/asymmetric-aggressive-up-conservative-down-autoscaling — the threshold-policy companion.
patterns/cost-based-load-balancing-llm — the routing pattern that consumes the same primitive.
systems/databricks-axon / systems/databricks-model-serving — the platform context.