Skip to content

PATTERN Cited by 1 source

Provisioned Throughput with On-Demand spillover

Pattern

Reserve dedicated Provisioned Throughput (PT) capacity for high-volume, latency-sensitive features that need consistent performance, route bursty / asynchronous workloads to On-Demand (OD) capacity to eliminate idle costs, and automatically spill excess PT requests over to OD when demand exceeds reserved limits — so the system never drops a request due to capacity ceilings.

The canonical wiki implementation: Slack's Phase 3 Hybrid Routing on Amazon Bedrock (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud), verbatim:

"We didn't simply flip a switch and move everything to On-Demand. To balance efficiency with a premium user experience, we implemented a Hybrid Routing strategy. We kept high-volume, latency-sensitive features on dedicated capacity (Provisioned Throughput) to ensure a consistent 'snappy' feel. Simultaneously, we moved asynchronous, bursty workloads – like nightly Recaps – to On-Demand capacity. To bridge the gap, we engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings."

When to use it

  • Mixed workload portfolio — some features are latency-sensitive with predictable load (PT-favourable), others are bursty / async / scheduled (OD-favourable).
  • PT-only is over-provisioned for off-peak — see concepts/llm-over-provisioning-cycle.
  • OD-only loses latency consistency — shared-pool variability is intolerable for the snappy-feeling features.
  • Capacity-ceiling-driven request drops are unacceptable — the spillover specifically prevents 429s on PT exhaustion.
  • Provider supports both primitives — Bedrock is canonical; GCP Vertex AI offers similar primitives; custom-built equivalent on self-hosted infra possible but more expensive.

When NOT to use it

  • Workload is uniformly latency-sensitive and predictable — pure PT is simpler and the spillover never triggers.
  • Workload is uniformly bursty and async — pure OD is simpler.
  • Provider doesn't expose both primitives — fallback to whichever is available.
  • Provider's spillover semantics are unclear — DIY spillover at the routing layer is feasible but requires health-aware routing already in place.

Three structural pieces

                ┌────────────────────────────┐
                │    Routing Layer           │
                │  - Workload-class tagging  │
                │  - PT capacity check       │
                │  - Spillover policy        │
                └─────────┬──────────────────┘
        ┌─────────────────┼──────────────────┐
        │                 │                  │
        ▼                 ▼                  ▼
   ┌────────┐         ┌────────┐         ┌────────┐
   │   PT   │         │   PT   │  ─────▶ │   OD   │
   │ slot 1 │  ...    │ slot N │ overflow│  pool  │
   │ for    │         │ for    │         │        │
   │snappy  │         │snappy  │         │ for    │
   │feature │         │feature │         │ bursty │
   └────────┘         └────────┘         └────────┘
   (latency-sensitive, predictable)      (async, bursty,
                                          spillover dest)

The pattern requires:

  1. Workload-class tagging — every feature configures itself as PT-default or OD-default at deployment time.
  2. PT capacity tracking — the routing layer knows the per-PT-allocation MU budget and tracks in-flight utilisation.
  3. Spillover policy — defined trigger (e.g. "PT MU utilisation > 95% AND queue depth > X") to route the next request to OD instead of waiting.

Two routes from PT-only to PT+OD-with-spillover

Slack's Phase 2 → Phase 3 path

  • Pre-PT: SageMaker self-managed (Phase 1).
  • PT-only: Move to Bedrock PT (Phase 2). Latency consistency wins, but over-provisioning cycle + commitment lock-in expose efficiency taxes.
  • PT + OD with spillover: Add OD for bursty features. Spillover absorbs PT excess. "For features with a 10x variance between peak and off-peak hours, the efficiency gains were substantial."

Alternative paths (not Slack's)

  • OD-first → add PT as load grows — start cheap, reserve capacity for features whose load reaches predictable-PT scale.
  • PT-only forever — if the workload portfolio is uniformly latency-sensitive, hybrid adds no value.

Trade-offs

Compared to… Wins Loses
PT-only Eliminates off-peak idle cost on bursty workloads; breaks commitment lock-in for OD-served features Operational complexity of managing two capacity tiers + spillover logic
OD-only Latency consistency for snappy features; predictable-cost floor Less responsive to total demand drops; pays PT cost during low-utilisation periods
Per-feature single-tier Simpler routing Either pays PT idle cost on bursty features or accepts OD variability for snappy ones
Multi-cloud LLM serving Concentration-risk reduction; per-feature model binding Larger operational tax; can be composed with PT+OD-spillover inside each cloud

What spillover is NOT

  • Not a hot/cold cache tier — both PT and OD serve every request type; the choice is per-request based on capacity, not per-data-class.
  • Not a queue depth absorber — spillover routes the request to a different capacity tier; it doesn't queue inside PT.
  • Not a reliability fallback — a failed PT call doesn't automatically spill over to OD; degradation is handled by the circuit breaker + model-fallback hierarchy.
  • Not multi-cloud — spillover happens within one cloud's PT and OD tiers. Cross-cloud routing is the multi-cloud LLM serving pattern, which composes around spillover.

Composition with other patterns

Risks and mitigations

  • OD pool saturation when spillover fires industry-wide — Slack's concentration risk reframing: when many customers spill over simultaneously, OD shared pool degrades. Mitigation: multi-cloud routing as the next layer.
  • Cost surprise from prolonged spillover — bursty traffic becomes structural and PT capacity is under-sized. Mitigation: monitor spillover ratio; resize PT.
  • Latency regression on spillover — OD endpoints have shared-resource variability vs PT's dedicated nature; users see latency change when spillover fires. Mitigation: keep spillover threshold high so it's a safety net, not the steady-state.

Seen in

Last updated · 542 distilled / 1,571 read