PATTERN Cited by 1 source

Provisioned Throughput with On-Demand spillover¶

Pattern¶

Reserve dedicated Provisioned Throughput (PT) capacity for high-volume, latency-sensitive features that need consistent performance, route bursty / asynchronous workloads to On-Demand (OD) capacity to eliminate idle costs, and automatically spill excess PT requests over to OD when demand exceeds reserved limits — so the system never drops a request due to capacity ceilings.

The canonical wiki implementation: Slack's Phase 3 Hybrid Routing on Amazon Bedrock (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud), verbatim:

"We didn't simply flip a switch and move everything to On-Demand. To balance efficiency with a premium user experience, we implemented a Hybrid Routing strategy. We kept high-volume, latency-sensitive features on dedicated capacity (Provisioned Throughput) to ensure a consistent 'snappy' feel. Simultaneously, we moved asynchronous, bursty workloads – like nightly Recaps – to On-Demand capacity. To bridge the gap, we engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings."

When to use it¶

Mixed workload portfolio — some features are latency-sensitive with predictable load (PT-favourable), others are bursty / async / scheduled (OD-favourable).
PT-only is over-provisioned for off-peak — see concepts/llm-over-provisioning-cycle.
OD-only loses latency consistency — shared-pool variability is intolerable for the snappy-feeling features.
Capacity-ceiling-driven request drops are unacceptable — the spillover specifically prevents 429s on PT exhaustion.
Provider supports both primitives — Bedrock is canonical; GCP Vertex AI offers similar primitives; custom-built equivalent on self-hosted infra possible but more expensive.

When NOT to use it¶

Workload is uniformly latency-sensitive and predictable — pure PT is simpler and the spillover never triggers.
Workload is uniformly bursty and async — pure OD is simpler.
Provider doesn't expose both primitives — fallback to whichever is available.
Provider's spillover semantics are unclear — DIY spillover at the routing layer is feasible but requires health-aware routing already in place.

Three structural pieces¶

                ┌────────────────────────────┐
                │    Routing Layer           │
                │  - Workload-class tagging  │
                │  - PT capacity check       │
                │  - Spillover policy        │
                └─────────┬──────────────────┘
                          │
        ┌─────────────────┼──────────────────┐
        │                 │                  │
        ▼                 ▼                  ▼
   ┌────────┐         ┌────────┐         ┌────────┐
   │   PT   │         │   PT   │  ─────▶ │   OD   │
   │ slot 1 │  ...    │ slot N │ overflow│  pool  │
   │ for    │         │ for    │         │        │
   │snappy  │         │snappy  │         │ for    │
   │feature │         │feature │         │ bursty │
   └────────┘         └────────┘         └────────┘
   (latency-sensitive, predictable)      (async, bursty,
                                          spillover dest)

The pattern requires:

Workload-class tagging — every feature configures itself as PT-default or OD-default at deployment time.
PT capacity tracking — the routing layer knows the per-PT-allocation MU budget and tracks in-flight utilisation.
Spillover policy — defined trigger (e.g. "PT MU utilisation > 95% AND queue depth > X") to route the next request to OD instead of waiting.

Two routes from PT-only to PT+OD-with-spillover¶

Slack's Phase 2 → Phase 3 path¶

Pre-PT: SageMaker self-managed (Phase 1).
PT-only: Move to Bedrock PT (Phase 2). Latency consistency wins, but over-provisioning cycle + commitment lock-in expose efficiency taxes.
PT + OD with spillover: Add OD for bursty features. Spillover absorbs PT excess. "For features with a 10x variance between peak and off-peak hours, the efficiency gains were substantial."

Alternative paths (not Slack's)¶

OD-first → add PT as load grows — start cheap, reserve capacity for features whose load reaches predictable-PT scale.
PT-only forever — if the workload portfolio is uniformly latency-sensitive, hybrid adds no value.

Trade-offs¶

Compared to…	Wins	Loses
PT-only	Eliminates off-peak idle cost on bursty workloads; breaks commitment lock-in for OD-served features	Operational complexity of managing two capacity tiers + spillover logic
OD-only	Latency consistency for snappy features; predictable-cost floor	Less responsive to total demand drops; pays PT cost during low-utilisation periods
Per-feature single-tier	Simpler routing	Either pays PT idle cost on bursty features or accepts OD variability for snappy ones
Multi-cloud LLM serving	Concentration-risk reduction; per-feature model binding	Larger operational tax; can be composed with PT+OD-spillover inside each cloud

What spillover is NOT¶

Not a hot/cold cache tier — both PT and OD serve every request type; the choice is per-request based on capacity, not per-data-class.
Not a queue depth absorber — spillover routes the request to a different capacity tier; it doesn't queue inside PT.
Not a reliability fallback — a failed PT call doesn't automatically spill over to OD; degradation is handled by the circuit breaker + model-fallback hierarchy.
Not multi-cloud — spillover happens within one cloud's PT and OD tiers. Cross-cloud routing is the multi-cloud LLM serving pattern, which composes around spillover.

Composition with other patterns¶

patterns/multi-cloud-llm-serving — composes outside; PT+OD with spillover happens inside each cloud's endpoint.
patterns/cost-based-load-balancing-llm — composes inside the OD pool to route MU-weighted load. Slack doesn't describe their MU-load-balancing internals; the principle generalises.
patterns/model-fallback-hierarchy-with-circuit-breaker — composes orthogonally; spillover handles capacity-ceiling events, fallback handles quality / health degradation.

Risks and mitigations¶

OD pool saturation when spillover fires industry-wide — Slack's concentration risk reframing: when many customers spill over simultaneously, OD shared pool degrades. Mitigation: multi-cloud routing as the next layer.
Cost surprise from prolonged spillover — bursty traffic becomes structural and PT capacity is under-sized. Mitigation: monitor spillover ratio; resize PT.
Latency regression on spillover — OD endpoints have shared-resource variability vs PT's dedicated nature; users see latency change when spillover fires. Mitigation: keep spillover threshold high so it's a safety net, not the steady-state.

Seen in¶

sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the PT+OD-with-spillover pattern as Slack's Phase 3 Hybrid Routing strategy on Amazon Bedrock, resolving the concepts/llm-over-provisioning-cycle and concepts/llm-provider-commitment-lock-in failure modes while preserving latency consistency for snappy features. Verbatim "never dropped a request due to capacity ceilings" framing.