Skip to content

CONCEPT Cited by 1 source

Provisioned Throughput vs On-Demand (LLM serving)

Definition

Provisioned Throughput (PT) and On-Demand (OD) are the two canonical capacity primitives that a managed LLM serving substrate (Amazon Bedrock, GCP Vertex AI, Azure AI Foundry, etc.) offers:

  • Provisioned Throughput — customer reserves a fixed amount of throughput capacity (denominated in Model Units or per-provider equivalent) for a contract term (Bedrock: 1–6 months). Capacity is dedicated to the customer; performance is predictable; cost is fixed regardless of utilisation.
  • On-Demand — customer pays per token / per request / per inference call against a shared pool. Capacity is shared with other customers; performance has shared-resource variability; cost is proportional to usage.

The wiki canonical framing comes from Slack's 2026-05-28 multi-cloud retrospective (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):

"Bedrock introduced Provisioned Throughput (PT) and On Demand (OD) infrastructure options, allowing us to tailor compute to specific use cases. We utilized PT for interactive, latency-sensitive features like channel summaries, while leveraging OD for bursty, scheduled workloads like Recap to eliminate costs for idle compute."

Trade-off table

Property Provisioned Throughput On-Demand
Capacity model Dedicated (reserved MUs) Shared pool
Performance Predictable; consistent latency Shared-resource variability
Cost Fixed regardless of utilisation Proportional to usage
Commitment 1–6 months (Bedrock) None
Best for Latency-sensitive, high-volume, predictable load Bursty, scheduled, async, idle-prone
Failure mode Over-provisioning during off-peak Service-level variability + concentration risk
Slack canonical use Channel summaries, AI Search Nightly Recap, scheduled batch summarisation

Why both exist (structural argument)

PT and OD optimise different points on the cost / performance / predictability surface. Slack's exhaustive walk through Phase 2 and Phase 3:

PT solves predictable-load latency consistency

  • Latency-sensitive interactive features (channel summaries, search) need consistent p99 latency to feel "snappy". PT's dedicated capacity guarantees this.
  • High base-load workloads that always have non-trivial utilisation — paying for dedicated capacity is more cost-efficient than per-token OD pricing at high consistent throughput.

PT introduces three failure modes

Slack's Phase 2 → 3 motivation enumerates three structural failure modes of PT:

  1. Over-provisioning cycle — peak-aware provisioning leads to substantial idle capacity during off-peak hours. "For features with a 10x variance between peak and off-peak hours, the efficiency gains [from OD] were substantial."
  2. Commitment lock-in — multi-month contracts slow down model upgrades.
  3. Capacity-planning friction — Slack had to "map the exact number of Model Units (MUs) required to match our SageMaker baseline across diverse traffic profiles."

OD solves bursty / async workloads

  • 10×-variance peak/off-peak workloads benefit massively from pay-per-use.
  • Async / batch / scheduled workloads (nightly Recap) don't need the dedicated-capacity guarantee — their latency tolerance is generous.
  • Removes commitment lock-in — every model upgrade decision is per-request, not per-contract.

OD introduces three failure modes

Slack's Phase 3 → 4 motivation:

  1. Service-level variability"OD operates on a shared-resource model, which typically carries different uptime characteristics."
  2. Regional capacity orchestration"Success with OD relies on the cloud provider's ability to manage demand across their entire customer base in specific regions, rather than having specific hardware units explicitly reserved for Slack."
  3. Concentration risk"Relying too heavily on a single provider's on-demand pool meant that any service-wide blip could have the potential to impact entire Slack AI features simultaneously."

The hybrid resolution

Slack's Phase 3 keeps PT for latency-sensitive features and moves bursty workloads to OD, with a spillover mechanism between them — see patterns/provisioned-throughput-with-on-demand-spillover. Verbatim:

"We engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings."

This canonicalises PT-with-OD-spillover as the default hybrid posture: PT carries the floor; OD absorbs spikes; the routing layer makes the decision.

Composition with neighbouring concepts

Concept Relationship
concepts/model-units The capacity primitive denominating both PT and OD on Bedrock.
concepts/multi-tenant-llm-capacity-allocation PT is the multi-tenant-LLM-platform-side mechanism for offering predictable per-customer capacity.
concepts/llm-over-provisioning-cycle The PT-specific failure mode driving the move to hybrid PT+OD.
concepts/llm-provider-commitment-lock-in The PT-specific failure mode driving model-upgrade friction.
concepts/concentration-risk-single-cloud-llm The OD-specific failure mode driving the move to multi-cloud.

Seen in

  • sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the PT/OD axis as the central capacity choice in Slack's Phase 2 (PT-only) → Phase 3 (Hybrid PT+OD with spillover) evolution; verbatim 10× peak/off-peak variance figure for OD-suitable workloads; three failure modes each side enumerated.
Last updated · 542 distilled / 1,571 read