PATTERN Cited by 1 source
Provisioned Throughput with On-Demand spillover¶
Pattern¶
Reserve dedicated Provisioned Throughput (PT) capacity for high-volume, latency-sensitive features that need consistent performance, route bursty / asynchronous workloads to On-Demand (OD) capacity to eliminate idle costs, and automatically spill excess PT requests over to OD when demand exceeds reserved limits — so the system never drops a request due to capacity ceilings.
The canonical wiki implementation: Slack's Phase 3 Hybrid Routing on Amazon Bedrock (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud), verbatim:
"We didn't simply flip a switch and move everything to On-Demand. To balance efficiency with a premium user experience, we implemented a Hybrid Routing strategy. We kept high-volume, latency-sensitive features on dedicated capacity (Provisioned Throughput) to ensure a consistent 'snappy' feel. Simultaneously, we moved asynchronous, bursty workloads – like nightly Recaps – to On-Demand capacity. To bridge the gap, we engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings."
When to use it¶
- Mixed workload portfolio — some features are latency-sensitive with predictable load (PT-favourable), others are bursty / async / scheduled (OD-favourable).
- PT-only is over-provisioned for off-peak — see concepts/llm-over-provisioning-cycle.
- OD-only loses latency consistency — shared-pool variability is intolerable for the snappy-feeling features.
- Capacity-ceiling-driven request drops are unacceptable — the spillover specifically prevents 429s on PT exhaustion.
- Provider supports both primitives — Bedrock is canonical; GCP Vertex AI offers similar primitives; custom-built equivalent on self-hosted infra possible but more expensive.
When NOT to use it¶
- Workload is uniformly latency-sensitive and predictable — pure PT is simpler and the spillover never triggers.
- Workload is uniformly bursty and async — pure OD is simpler.
- Provider doesn't expose both primitives — fallback to whichever is available.
- Provider's spillover semantics are unclear — DIY spillover at the routing layer is feasible but requires health-aware routing already in place.
Three structural pieces¶
┌────────────────────────────┐
│ Routing Layer │
│ - Workload-class tagging │
│ - PT capacity check │
│ - Spillover policy │
└─────────┬──────────────────┘
│
┌─────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ PT │ │ PT │ ─────▶ │ OD │
│ slot 1 │ ... │ slot N │ overflow│ pool │
│ for │ │ for │ │ │
│snappy │ │snappy │ │ for │
│feature │ │feature │ │ bursty │
└────────┘ └────────┘ └────────┘
(latency-sensitive, predictable) (async, bursty,
spillover dest)
The pattern requires:
- Workload-class tagging — every feature configures itself as PT-default or OD-default at deployment time.
- PT capacity tracking — the routing layer knows the per-PT-allocation MU budget and tracks in-flight utilisation.
- Spillover policy — defined trigger (e.g. "PT MU utilisation > 95% AND queue depth > X") to route the next request to OD instead of waiting.
Two routes from PT-only to PT+OD-with-spillover¶
Slack's Phase 2 → Phase 3 path¶
- Pre-PT: SageMaker self-managed (Phase 1).
- PT-only: Move to Bedrock PT (Phase 2). Latency consistency wins, but over-provisioning cycle + commitment lock-in expose efficiency taxes.
- PT + OD with spillover: Add OD for bursty features. Spillover absorbs PT excess. "For features with a 10x variance between peak and off-peak hours, the efficiency gains were substantial."
Alternative paths (not Slack's)¶
- OD-first → add PT as load grows — start cheap, reserve capacity for features whose load reaches predictable-PT scale.
- PT-only forever — if the workload portfolio is uniformly latency-sensitive, hybrid adds no value.
Trade-offs¶
| Compared to… | Wins | Loses |
|---|---|---|
| PT-only | Eliminates off-peak idle cost on bursty workloads; breaks commitment lock-in for OD-served features | Operational complexity of managing two capacity tiers + spillover logic |
| OD-only | Latency consistency for snappy features; predictable-cost floor | Less responsive to total demand drops; pays PT cost during low-utilisation periods |
| Per-feature single-tier | Simpler routing | Either pays PT idle cost on bursty features or accepts OD variability for snappy ones |
| Multi-cloud LLM serving | Concentration-risk reduction; per-feature model binding | Larger operational tax; can be composed with PT+OD-spillover inside each cloud |
What spillover is NOT¶
- Not a hot/cold cache tier — both PT and OD serve every request type; the choice is per-request based on capacity, not per-data-class.
- Not a queue depth absorber — spillover routes the request to a different capacity tier; it doesn't queue inside PT.
- Not a reliability fallback — a failed PT call doesn't automatically spill over to OD; degradation is handled by the circuit breaker + model-fallback hierarchy.
- Not multi-cloud — spillover happens within one cloud's PT and OD tiers. Cross-cloud routing is the multi-cloud LLM serving pattern, which composes around spillover.
Composition with other patterns¶
- patterns/multi-cloud-llm-serving — composes outside; PT+OD with spillover happens inside each cloud's endpoint.
- patterns/cost-based-load-balancing-llm — composes inside the OD pool to route MU-weighted load. Slack doesn't describe their MU-load-balancing internals; the principle generalises.
- patterns/model-fallback-hierarchy-with-circuit-breaker — composes orthogonally; spillover handles capacity-ceiling events, fallback handles quality / health degradation.
Risks and mitigations¶
- OD pool saturation when spillover fires industry-wide — Slack's concentration risk reframing: when many customers spill over simultaneously, OD shared pool degrades. Mitigation: multi-cloud routing as the next layer.
- Cost surprise from prolonged spillover — bursty traffic becomes structural and PT capacity is under-sized. Mitigation: monitor spillover ratio; resize PT.
- Latency regression on spillover — OD endpoints have shared-resource variability vs PT's dedicated nature; users see latency change when spillover fires. Mitigation: keep spillover threshold high so it's a safety net, not the steady-state.
Seen in¶
- sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the PT+OD-with-spillover pattern as Slack's Phase 3 Hybrid Routing strategy on Amazon Bedrock, resolving the concepts/llm-over-provisioning-cycle and concepts/llm-provider-commitment-lock-in failure modes while preserving latency consistency for snappy features. Verbatim "never dropped a request due to capacity ceilings" framing.
Related¶
- concepts/provisioned-throughput-vs-on-demand-llm
- concepts/llm-over-provisioning-cycle
- concepts/llm-provider-commitment-lock-in
- concepts/model-units
- concepts/multi-tenant-llm-capacity-allocation
- systems/amazon-bedrock
- systems/slack-ai
- systems/slack-intelligent-routing-layer
- patterns/multi-cloud-llm-serving
- patterns/cost-based-load-balancing-llm