CONCEPT Cited by 1 source
Provisioned Throughput vs On-Demand (LLM serving)¶
Definition¶
Provisioned Throughput (PT) and On-Demand (OD) are the two canonical capacity primitives that a managed LLM serving substrate (Amazon Bedrock, GCP Vertex AI, Azure AI Foundry, etc.) offers:
- Provisioned Throughput — customer reserves a fixed amount of throughput capacity (denominated in Model Units or per-provider equivalent) for a contract term (Bedrock: 1–6 months). Capacity is dedicated to the customer; performance is predictable; cost is fixed regardless of utilisation.
- On-Demand — customer pays per token / per request / per inference call against a shared pool. Capacity is shared with other customers; performance has shared-resource variability; cost is proportional to usage.
The wiki canonical framing comes from Slack's 2026-05-28 multi-cloud retrospective (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):
"Bedrock introduced Provisioned Throughput (PT) and On Demand (OD) infrastructure options, allowing us to tailor compute to specific use cases. We utilized PT for interactive, latency-sensitive features like channel summaries, while leveraging OD for bursty, scheduled workloads like Recap to eliminate costs for idle compute."
Trade-off table¶
| Property | Provisioned Throughput | On-Demand |
|---|---|---|
| Capacity model | Dedicated (reserved MUs) | Shared pool |
| Performance | Predictable; consistent latency | Shared-resource variability |
| Cost | Fixed regardless of utilisation | Proportional to usage |
| Commitment | 1–6 months (Bedrock) | None |
| Best for | Latency-sensitive, high-volume, predictable load | Bursty, scheduled, async, idle-prone |
| Failure mode | Over-provisioning during off-peak | Service-level variability + concentration risk |
| Slack canonical use | Channel summaries, AI Search | Nightly Recap, scheduled batch summarisation |
Why both exist (structural argument)¶
PT and OD optimise different points on the cost / performance / predictability surface. Slack's exhaustive walk through Phase 2 and Phase 3:
PT solves predictable-load latency consistency¶
- Latency-sensitive interactive features (channel summaries, search) need consistent p99 latency to feel "snappy". PT's dedicated capacity guarantees this.
- High base-load workloads that always have non-trivial utilisation — paying for dedicated capacity is more cost-efficient than per-token OD pricing at high consistent throughput.
PT introduces three failure modes¶
Slack's Phase 2 → 3 motivation enumerates three structural failure modes of PT:
- Over-provisioning cycle — peak-aware provisioning leads to substantial idle capacity during off-peak hours. "For features with a 10x variance between peak and off-peak hours, the efficiency gains [from OD] were substantial."
- Commitment lock-in — multi-month contracts slow down model upgrades.
- Capacity-planning friction — Slack had to "map the exact number of Model Units (MUs) required to match our SageMaker baseline across diverse traffic profiles."
OD solves bursty / async workloads¶
- 10×-variance peak/off-peak workloads benefit massively from pay-per-use.
- Async / batch / scheduled workloads (nightly Recap) don't need the dedicated-capacity guarantee — their latency tolerance is generous.
- Removes commitment lock-in — every model upgrade decision is per-request, not per-contract.
OD introduces three failure modes¶
Slack's Phase 3 → 4 motivation:
- Service-level variability — "OD operates on a shared-resource model, which typically carries different uptime characteristics."
- Regional capacity orchestration — "Success with OD relies on the cloud provider's ability to manage demand across their entire customer base in specific regions, rather than having specific hardware units explicitly reserved for Slack."
- Concentration risk — "Relying too heavily on a single provider's on-demand pool meant that any service-wide blip could have the potential to impact entire Slack AI features simultaneously."
The hybrid resolution¶
Slack's Phase 3 keeps PT for latency-sensitive features and moves bursty workloads to OD, with a spillover mechanism between them — see patterns/provisioned-throughput-with-on-demand-spillover. Verbatim:
"We engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings."
This canonicalises PT-with-OD-spillover as the default hybrid posture: PT carries the floor; OD absorbs spikes; the routing layer makes the decision.
Composition with neighbouring concepts¶
| Concept | Relationship |
|---|---|
| concepts/model-units | The capacity primitive denominating both PT and OD on Bedrock. |
| concepts/multi-tenant-llm-capacity-allocation | PT is the multi-tenant-LLM-platform-side mechanism for offering predictable per-customer capacity. |
| concepts/llm-over-provisioning-cycle | The PT-specific failure mode driving the move to hybrid PT+OD. |
| concepts/llm-provider-commitment-lock-in | The PT-specific failure mode driving model-upgrade friction. |
| concepts/concentration-risk-single-cloud-llm | The OD-specific failure mode driving the move to multi-cloud. |
Seen in¶
- sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the PT/OD axis as the central capacity choice in Slack's Phase 2 (PT-only) → Phase 3 (Hybrid PT+OD with spillover) evolution; verbatim 10× peak/off-peak variance figure for OD-suitable workloads; three failure modes each side enumerated.
Related¶
- concepts/llm-over-provisioning-cycle
- concepts/llm-provider-commitment-lock-in
- concepts/model-units
- concepts/multi-tenant-llm-capacity-allocation
- concepts/concentration-risk-single-cloud-llm
- systems/amazon-bedrock
- systems/slack-ai
- patterns/provisioned-throughput-with-on-demand-spillover
- patterns/multi-cloud-llm-serving