PATTERN Cited by 1 source

Model fallback hierarchy with circuit breaker¶

Pattern¶

For every LLM-powered feature, designate a primary model and one or more backup models in a defined fallback hierarchy. Combine with an automated circuit breaker that monitors endpoint-level health signals (TTFT, p90 latency, 5xx error rate) in real time and reroutes traffic to the next model in the hierarchy when the primary degrades — with a partial-open recovery state that gradually ramps traffic back to the recovering endpoint.

The canonical wiki implementation: the fallback + circuit breaker subsystems of Slack's Intelligent Routing Layer (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):

"We developed a model hierarchy for every AI feature, allowing our system to automatically fall back to different models if the primary model reached a degraded state. Some examples of regressions are elevated time to first token latencies, throttling errors, and downward trend in customer feedback. […] If a specific model was underperforming or hitting limits in one region, the platform would reroute the request in real-time to another healthy endpoint. From the customer's perspective, the experience remained seamless; they continued to receive high-quality results without ever knowing a complex failover had occurred behind the scenes."

When to use it¶

LLM-powered features at production scale — millions of user-visible requests where any single model degradation has customer impact.
Multiple comparable-quality models available — primary + backup pairs require enough catalogue depth that the backup is not catastrophically worse than the primary.
Real-time health signals available — the breaker requires per-endpoint TTFT / p90 / error rate streams.
Provider-side failures are non-zero — even reliable providers have model-level degradations, throttling spikes, and capacity exhaustion events.

When NOT to use it¶

One-model deployments — the pattern is meaningless without designated backups.
Very low-volume features — the breaker's signal-to-noise ratio is poor at low request counts.
Latency budgets too tight for routing-layer overhead — most LLM workloads have headroom; some real-time inference doesn't.

Three structural pieces¶

Per-feature configuration:
  feature: "ai_search"
  models:
    - { provider: A, sku: high-reasoning-v3, role: primary }
    - { provider: B, sku: high-reasoning-v2, role: backup_1 }
    - { provider: A, sku: high-reasoning-v2, role: backup_2 }
  health_thresholds:
    ttft_p90_ms: <undisclosed>
    p90_latency_ms: <undisclosed>
    error_rate_pct: <undisclosed>

           ┌─────────────────────┐
   Request │ Routing Layer       │
   ────────│   1. Pick model     │
           │      from hierarchy │
           │   2. Check breaker  │
           │      state          │
           │   3. Forward / fall │
           │      to next        │
           └─────────────────────┘
                    │
        ┌───────────┼───────────┐
        ▼           ▼           ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Primary  │ │ Backup 1 │ │ Backup 2 │
  │ CLOSED   │ │ CLOSED   │ │ CLOSED   │
  └──────────┘ └──────────┘ └──────────┘
       │
       ├─[degrades: TTFT↑, p90↑, 5xx↑]
       ▼
  ┌──────────┐
  │ Primary  │  reroute to Backup 1
  │ OPEN     │
  └──────────┘
       │
       ├─[cooldown]
       ▼
  ┌──────────┐
  │ Primary  │  trickle of requests
  │ PARTIAL- │  starts; expands as
  │ OPEN     │  health is sustained
  └──────────┘

The pattern requires:

Per-feature primary + backup designation — every feature has a defined fallback hierarchy in configuration, not in code.
Real-time health monitoring at endpoint level — TTFT, p90 latency, 5xx error rate streams updated continuously.
Automated circuit breaker with partial-open ramp — see concepts/automated-circuit-breaker-with-partial-open-state.

What "regression" means (Slack disclosed)¶

The 2026-05-28 source enumerates three named regression signals:

Elevated TTFT — "elevated time to first token latencies".
Throttling errors — provider rate-limit / quota exhaustion.
Downward trend in customer feedback — Slack signals it treats user feedback as a first-class soft-failure signal, not just hard-fail metrics.

The third is novel — Slack's verbatim "redefining the meaning of 'Failure'" framing canonicalises soft failures (p90 spikes, feedback trends) as breaker triggers alongside hard errors.

Composition with other patterns¶

patterns/circuit-breaker (classical) — direct refinement; classical pattern protects one dependency, this pattern composes the breaker with a hierarchy of alternatives.
patterns/multi-cloud-llm-serving — composes outside; the fallback hierarchy can include cross-cloud backups.
patterns/api-normalization-layer-cross-provider — composes alongside; the breaker depends on unified health/error vocabulary.
patterns/provisioned-throughput-with-on-demand-spillover — composes orthogonally; spillover handles capacity-ceiling events while fallback handles degradation events.

Trade-offs¶

Compared to…	Wins	Loses
No fallback	Provider degradation events become invisible to users	Configuration complexity per feature; testing all fallback paths
Single backup, no breaker	Simpler failover	No automated recovery; manual intervention to restore primary
Manual operator failover	Full human judgment	Slow; on-call burden; depends on detection latency
Multi-cloud LLM serving	Provider redundancy beyond model redundancy	More operational complexity; multi-cloud composes around this pattern

Trade-offs of partial-open recovery¶

Partial-open ramp prevents thundering-herd-on-recovery (concepts/thundering-herd) at the cost of a longer recovery duration. Slack's framing verbatim:

"This ensures a graceful recovery without overwhelming a stabilizing service."

Risks and mitigations¶

Backup model quality regression — the system fails over but produces lower-quality output. Mitigation: per-feature primary + backup chosen with quality parity in mind; backup is "high-quality", not "any working".
Cascading failover — primary degrades, backup gets flooded, backup degrades too. Mitigation: the breaker on each endpoint is independent; flooded backup trips its own breaker; routing falls to the next-tier backup.
Stale health data — breaker decisions based on outdated signals. Mitigation: high-frequency health signal updates; short evaluation windows.
Customer-feedback trend is a slow signal — the "downward trend in customer feedback" takes minutes-to-hours to detect. Mitigation: composed with TTFT/p90 fast signals; feedback drives long-tail re-evaluation, not real-time rerouting.

Risks and mitigations specific to partial-open¶

Ramp too fast → re-trips on recovering endpoint. Mitigation: conservative ramp curve; require multiple consecutive sustained-health windows.
Ramp too slow → wastes capacity; needlessly degraded user experience. Mitigation: exponential ramp with health- based gating.
Stuck in partial-open → endpoint never recovers fully. Mitigation: timeout that escalates to OPEN if ramp doesn't reach CLOSED in N minutes.

Seen in¶

sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of model fallback hierarchy with circuit breaker as Slack's Phase 3+ resilience pattern, generalised at Phase 4 to cross-cloud fallback. Verbatim TTFT + throttling + customer feedback regression triggers; partial-open recovery state with dynamic ramp expansion framing.