Skip to content

CONCEPT Cited by 1 source

LLM over-provisioning cycle

Definition

The LLM over-provisioning cycle is the structural failure mode of static / dedicated LLM serving capacity (Provisioned Throughput on Bedrock or equivalent) when workload demand follows a strong diurnal pattern with global peaks and off-peaks. To meet peak SLAs, the customer reserves capacity at the global maximum — but pays the same cost during periods when actual utilisation is far below the reserved capacity.

The wiki canonical framing comes from Slack's 2026-05-28 multi-cloud retrospective (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):

"The Over-Provisioning Cycle: Our infrastructure needs are very closely aligned by the global workday traffic patterns. To ensure a snappy experience during the massive US East and West Coast morning surges – when users lean heavily on AI Summaries and Search to catch up on activity – we had to maintain a high baseline of MUs. While we saw steadier, lighter usage during the APAC and EU mornings, we had to provision for that absolute global peak. This meant we were often paying for significant underutilized capacity during the troughs between regional handoffs and over the weekends, creating a persistent efficiency gap."

Structural diagnosis

The cycle has three load-bearing components:

  1. Diurnal demand pattern with strong regional peaks. For global workplace SaaS like Slack, US East/West Coast mornings produce the largest traffic spike of the global day, when users "lean heavily on AI Summaries and Search to catch up on activity."
  2. Latency-sensitive feature SLA that requires the peak capacity to be available the moment the peak begins. This forecloses cron-based scaling that lags the demand curve.
  3. Dedicated-capacity pricing — Provisioned Throughput prices are flat across the contract term regardless of utilisation. The customer pays the global-peak price 24 hours / 7 days per week while utilisation varies by workload.

The combination produces a persistent efficiency gap: the delta between the always-paid peak capacity and the much-lower-than-peak average utilisation.

Numerical signature (disclosed)

Slack discloses a 10× variance between peak and off-peak hours for features that fit the OD model (e.g. nightly Recap). The exact ratio for PT-served features (channel summaries, AI Search) is not disclosed but described as a "persistent efficiency gap" with "steadier, lighter usage during the APAC and EU mornings" between US-coast regional handoffs.

The 10× figure is the wiki's canonical disclosure of the peak/off-peak amplitude in enterprise LLM serving workloads.

Why this isn't solved by intra-PT scaling

Provisioned Throughput on Bedrock can be scaled up and down, but two structural constraints make it ineffective for the over-provisioning cycle:

  • Re-allocation friction — moving MUs between models or features takes operational coordination, not real-time market response.
  • Commitment terms"Provisioned Throughput often required commitments of one to six months." Capacity bought at contract time is paid for through the term regardless of whether daily off-peak utilisation drops to 10% of peak.

Resolution: hybrid PT+OD with spillover

Slack's Phase 3 architecture treats the over-provisioning cycle as the primary justification for moving bursty workloads off PT entirely. Verbatim:

"For features with a 10x variance between peak and off-peak hours, the efficiency gains were substantial. […] These challenges led us to our next evolution: finding a way to balance the reliability of provisioned capacity with the economic and technical flexibility of On-Demand scaling."

The PT-with-OD-spillover pattern is the canonical resolution: PT carries the floor for latency-sensitive workloads; OD absorbs the bursty excess; spillover automatically routes overflow.

Composition with neighbouring concepts

Concept Relationship
concepts/diurnal-traffic-pattern The general traffic shape that produces the over-provisioning cycle.
concepts/provisioned-throughput-vs-on-demand-llm PT-specific failure mode the over-provisioning cycle exposes.
concepts/llm-provider-commitment-lock-in The contract-term constraint compounding the cycle.
concepts/over-provisioning (general) LLM-serving-specialised case of the broader over-provisioning concept.

Distinguishing from neighbouring failure modes

  • Cold-start over-provisioning (e.g. SageMaker Phase 1) — paying for idle GPU instances "to meet peak SLAs". Same failure mode at the IaaS-altitude vs the managed-LLM-throughput altitude here.
  • Static-fleet auto-scaler under-fitting — fleet sized to expected average rather than peak; produces SLA violations, not idle cost.
  • Provider-side over-provisioning — Bedrock's OD shared pool relies on the provider absorbing customer-base over- provisioning; the customer doesn't see it directly.

Seen in

  • sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the LLM over-provisioning cycle as the structural driver of Slack's Phase 2 → Phase 3 evolution from PT-only to Hybrid PT+OD with spillover. Verbatim 10× peak/off-peak amplitude for the most-favourable workloads.
Last updated · 542 distilled / 1,571 read