Skip to content

CONCEPT Cited by 1 source

Multi-tenant LLM capacity allocation (VM-equivalent guarantees)

Definition

Multi-tenant LLM capacity allocation is the architectural problem of providing predictable, contractable capacity to individual customers on a shared LLM-serving substrate — analogous to how cloud VMs offer predictable per-customer compute capacity, in contrast to best-effort multi-tenant serving where capacity may be silently clawed back when other tenants spike.

The Databricks-coined framing (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale):

"Such estimations are structurally imperfect, but they serve as a way for us to break a multi-tenant system into something more manageable that resembles cloud VMs. VMs have the desirable property of offering predictable performance that can be allocated to specific customers. For production agentic workloads, it's important to offer guarantees around low latency and capacity, and without such allocation systems, the best we can do is offer 'best-effort' capacity that could be clawed back if too many customers use the system."

The structural problem

Multi-tenant LLM serving on shared GPU infrastructure has a tension between three goals:

Goal Argument for Argument against
High utilisation GPUs are expensive; idle capacity wastes money Tight packing means a tenant's bursts squeeze others
Per-customer SLO Production agentic workloads need predictable p95 TTFT and OPTS Reserving fixed capacity per customer means much idle capacity
Operational simplicity One pool of replicas serving all customers is easy to operate Hard to debug per-customer issues; noisy-neighbor problems

Cloud VMs solve the analogous problem in classical compute by carving the host into per-customer slices — each VM gets a guaranteed minimum (CPU shares, RAM, IOPS), and the workload scheduler enforces it. The customer gets a predictable capacity contract.

LLM serving has historically been not like that: customers are admitted into a shared pool, sized by aggregate demand, with QoS at the load-balancer level (rate limits per tenant) but no per-customer capacity guarantee at the GPU level.

Why "best-effort" is structurally inadequate for agentic workloads

The post is explicit that best-effort is insufficient for production agentic workloads:

"For production agentic workloads, it's important to offer guarantees around low latency and capacity."

The structural reason: agentic workloads have characteristics that amplify variance:

  • Long-context coding agent: each turn replays long prompt history; if cache hit rate drops or the request gets routed to a cold pod, p95 TTFT explodes.
  • Multi-step reasoning: the customer's product latency is the sum of several LLM calls; tail latency on any single call surfaces in the user experience.
  • Burstiness coupled to working hours: many agents' demand spikes during the same working hours (Figure 1 of the source — "dramatic spikes within hours"); without per-customer reservation, every spiky customer competes for the same pool.

Best-effort multi-tenant capacity that "could be clawed back if too many customers use the system" is incompatible with these properties: the customer can't build a reliable agent on top of serving capacity that disappears when other customers are busy.

VM-equivalence as the design target

The Databricks design target: each customer's capacity contract is a number of model units per minute, plus a latency SLO (e.g. p95 TTFT < 500 ms, OPTS > 100 t/s). The platform guarantees the model-unit budget, just like a VM guarantees CPU shares.

The structural pieces needed:

  • A unit of accountmodel units themselves; cannot allocate requests/min because requests vary in cost.
  • An admission controller that knows each customer's allocation and admits up to the budget, rejecting (or throttling) above it. "Requests go through rate limiting before reaching the data plane."
  • A capacity planner that ensures Σ(allocated MUs/min across customers) ≤ Σ(provisioned MUs/min across replicas) — minus some headroom for spikes within budget.
  • An autoscaler that reacts to aggregate utilisation (MU utilisation ratio) and grows the pool when total demand approaches total supply.
  • A router (Axon) that respects per-customer affinity (sticky sessions) so cache locality and blast radius are bounded per customer.

All five pieces are denominated in the same currency (MUs) so that allocation, admission, scheduling, scaling, and routing all agree on what "load" means.

Why request-count-based allocation is structurally insufficient

A simple request-count allocation (e.g. "this customer gets 1,000 requests/min") fails because request cost varies non-linearly with the input/output token shape. A customer using their 1,000 requests/min as cheap-and-short autocomplete gets very different compute consumption than a customer using their 1,000 requests/min as long-context coding-agent calls. Either:

  • The platform under-allocates (sizes for worst-case shape) and wastes capacity on cheap customers, or
  • The platform over-allocates (sizes for average shape) and customers with expensive shapes blow the budget.

Model units paper over the variance directly: a customer with X MUs/min gets X MUs/min worth of compute, regardless of request shape. Cheap customers fit more requests in their budget; expensive customers fit fewer; the platform's compute spend is bounded.

Composition

Adjacent designs (for comparison)

  • AWS Bedrock Provisioned Throughput — sells per-customer model units at a fixed dollar amount, with the customer reserving capacity ahead of time. Conceptually similar VM-like framing for LLM serving capacity, with explicit pricing.
  • OpenAI / Anthropic API tiers — best-effort with per-tier rate limits; closer to the "best-effort" shape Databricks is arguing against.
  • Cloudflare Workers AI — best-effort with auto-scaling at the endpoint level; per-customer reservation not exposed.
  • Cloud VM compute — the analogue Databricks names directly.

Open questions

  • Pricing model — is the customer billed per-MU-reserved (like reserved instances) or per-MU-consumed (like on-demand) or both? Not disclosed.
  • Burst-above-allocation policy — does the platform allow customers to use more MUs/min than reserved when capacity is available? Not disclosed.
  • Migration from best-effort — how does an existing customer migrate from a request-count rate limit to an MU-based reservation? Not described.
  • Public capacity contract surface — is the MU number exposed to customers, or is it an internal allocation that gets translated into request-shape commitments at the API boundary? Not disclosed.

Seen in

Last updated · 542 distilled / 1,571 read