Skip to content

PATTERN Cited by 1 source

Model-units utilization autoscaling

Pattern

Drive LLM-serving autoscaling decisions from the model unit utilisation ratio averaged across pods, with per-replica capacity targets benchmarked per (model, hardware) pair, and with the same control-loop running across every model — no per-model threshold tuning.

Canonical wiki disclosure (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure."

When to use it

  • LLM serving where the platform serves many models on shared infrastructure, and per-model autoscaling-rule tuning would scale poorly with the model catalog.
  • Workloads with non-uniform request cost (long-context, multimodal, agentic) where request_concurrency would systematically misrepresent load.
  • Bursty traffic shapes where >80% GPU savings vs static-peak provisioning is the operational goal.
  • Multi-tenant platforms where the same currency (MUs) is used for capacity allocation and for autoscaling, so allocation and scaling agree on what "load" means.

When NOT to use it

  • Single-model serving with uniform request shape — simpler request_concurrency autoscaling is correct and cheaper.
  • Latency-bounded workloads where direct latency-SLO autoscaling is acceptable (and the team has the operational discipline to manage SLO-direct autoscaling's lagging-signal failure mode).
  • Workloads without disclosed MU coefficients (i.e. before the benchmarking pass has been done for this model).

Structural shape

                ┌──────────────────────────────────┐
                │ Per-pod MU load                  │
                │ (current_in_flight_MUs / max_MUs)│
                └────────────┬─────────────────────┘
                             │   per-pod scalar
                ┌──────────────────────────────────┐
                │ Aggregate (average across pods)  │
                │ over short window                │
                └────────────┬─────────────────────┘
                             │   utilisation ratio
                ┌──────────────────────────────────┐
                │ Asymmetric thresholds:           │
                │  - util > T_up sustained → up    │
                │  - util < T_down sustained → down│
                │  - else hold                     │
                └────────────┬─────────────────────┘
                             │   replica delta
                ┌──────────────────────────────────┐
                │ Provisioner (Kubernetes / equiv) │
                │ spawns or terminates pods        │
                └──────────────────────────────────┘

The pattern requires four pieces:

  1. Per-pod MU load reporter — the inference engine emits current_in_flight_MUs continuously.
  2. Per-replica MU capacity numbermax_MUs_per_replica benchmarked per (model, hardware). Derived from a load test that ramps the pod until p99 latency starts to climb, then backs off slightly. Re-derived on model / kernel / hardware changes.
  3. Aggregator — averages MU utilisation across pods over a short window. Average dampens single-hot-pod inflation.
  4. Threshold-pair policy — asymmetric scale-up / scale-down thresholds with hold-times, typically pairing with asymmetric aggressive-up, conservative-down.

Composition

Quantitative payoff

Disclosed: >80% GPU savings vs static provisioning at peak for bursty-traffic workloads:

"Building autoscaling on top of LLM inference patterns saved us from always scaling to max replicas. For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak."

The 80% number is the headline argument for this pattern's value: on workloads where peak is 5-10× average, static-peak provisioning wastes most of the capacity outside the peak. Cost-based autoscaling tracks demand. The savings are realised because:

  • Frontier GPUs are expensive (the post calls them out as a cost-prohibitive provisioning decision).
  • Bursty workloads (agentic, working-hours-coupled) idle most of the time.
  • The platform can pack MU-budget-respecting customers into shared pods elastically.

Trade-offs

Compared to… Wins Loses
request_concurrency Correct under non-uniform cost Coefficient-benchmarking cost; bad coefficient = bad scaling
Latency-SLO direct autoscaling Leading signal vs lagging Less directly aligned with SLO; needs threshold tuning
CPU%-based Always correct (CPU% lies for GPU work) n/a
Static-peak provisioning 80%+ cost savings on bursty workloads Can't deliver guaranteed peak-burst latency without buffer
Predictive (forecast-based) Always correct (reactive lags ramp) Less accurate on novel traffic shapes; needs forecast model

Risks and mitigations

  • Coefficient drift — model updates / kernel changes / GPU changes can invalidate the per-replica capacity number. Mitigation: re-benchmark on every change.
  • Burst overshoots target before scale-up reacts — the same reactive-autoscaling failure mode that exists for request_concurrency. Mitigation: aggressive scale-up window; small minimum-replica floor; combine with predictive autoscaling for known-spiky periods.
  • Errors complete fast and drop the signal — under partial outage, MU load drops and the autoscaler scales down spuriously. Mitigation: error rate as a separate scaling-veto signal.
  • Single hot pod inflates the average — pod doing one expensive request looks ~saturated; average is biased high. Mitigation: trimmed mean / median statistic instead of mean.

Open questions

  • Threshold values (T_up, T_down) — not disclosed.
  • Window durations (sustained-up, sustained-down) — not disclosed.
  • Scale-up step size — adds 1 pod per loop or many? Not disclosed.
  • Multi-tier hardware — does one threshold work across H100, NVL72, or are thresholds per-hardware?

Seen in

Last updated · 542 distilled / 1,571 read