PATTERN Cited by 1 source
Model fallback hierarchy with circuit breaker¶
Pattern¶
For every LLM-powered feature, designate a primary model and one or more backup models in a defined fallback hierarchy. Combine with an automated circuit breaker that monitors endpoint-level health signals (TTFT, p90 latency, 5xx error rate) in real time and reroutes traffic to the next model in the hierarchy when the primary degrades — with a partial-open recovery state that gradually ramps traffic back to the recovering endpoint.
The canonical wiki implementation: the fallback + circuit breaker subsystems of Slack's Intelligent Routing Layer (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):
"We developed a model hierarchy for every AI feature, allowing our system to automatically fall back to different models if the primary model reached a degraded state. Some examples of regressions are elevated time to first token latencies, throttling errors, and downward trend in customer feedback. […] If a specific model was underperforming or hitting limits in one region, the platform would reroute the request in real-time to another healthy endpoint. From the customer's perspective, the experience remained seamless; they continued to receive high-quality results without ever knowing a complex failover had occurred behind the scenes."
When to use it¶
- LLM-powered features at production scale — millions of user-visible requests where any single model degradation has customer impact.
- Multiple comparable-quality models available — primary + backup pairs require enough catalogue depth that the backup is not catastrophically worse than the primary.
- Real-time health signals available — the breaker requires per-endpoint TTFT / p90 / error rate streams.
- Provider-side failures are non-zero — even reliable providers have model-level degradations, throttling spikes, and capacity exhaustion events.
When NOT to use it¶
- One-model deployments — the pattern is meaningless without designated backups.
- Very low-volume features — the breaker's signal-to-noise ratio is poor at low request counts.
- Latency budgets too tight for routing-layer overhead — most LLM workloads have headroom; some real-time inference doesn't.
Three structural pieces¶
Per-feature configuration:
feature: "ai_search"
models:
- { provider: A, sku: high-reasoning-v3, role: primary }
- { provider: B, sku: high-reasoning-v2, role: backup_1 }
- { provider: A, sku: high-reasoning-v2, role: backup_2 }
health_thresholds:
ttft_p90_ms: <undisclosed>
p90_latency_ms: <undisclosed>
error_rate_pct: <undisclosed>
┌─────────────────────┐
Request │ Routing Layer │
────────│ 1. Pick model │
│ from hierarchy │
│ 2. Check breaker │
│ state │
│ 3. Forward / fall │
│ to next │
└─────────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Primary │ │ Backup 1 │ │ Backup 2 │
│ CLOSED │ │ CLOSED │ │ CLOSED │
└──────────┘ └──────────┘ └──────────┘
│
├─[degrades: TTFT↑, p90↑, 5xx↑]
▼
┌──────────┐
│ Primary │ reroute to Backup 1
│ OPEN │
└──────────┘
│
├─[cooldown]
▼
┌──────────┐
│ Primary │ trickle of requests
│ PARTIAL- │ starts; expands as
│ OPEN │ health is sustained
└──────────┘
The pattern requires:
- Per-feature primary + backup designation — every feature has a defined fallback hierarchy in configuration, not in code.
- Real-time health monitoring at endpoint level — TTFT, p90 latency, 5xx error rate streams updated continuously.
- Automated circuit breaker with partial-open ramp — see concepts/automated-circuit-breaker-with-partial-open-state.
What "regression" means (Slack disclosed)¶
The 2026-05-28 source enumerates three named regression signals:
- Elevated TTFT — "elevated time to first token latencies".
- Throttling errors — provider rate-limit / quota exhaustion.
- Downward trend in customer feedback — Slack signals it treats user feedback as a first-class soft-failure signal, not just hard-fail metrics.
The third is novel — Slack's verbatim "redefining the meaning of 'Failure'" framing canonicalises soft failures (p90 spikes, feedback trends) as breaker triggers alongside hard errors.
Composition with other patterns¶
- patterns/circuit-breaker (classical) — direct refinement; classical pattern protects one dependency, this pattern composes the breaker with a hierarchy of alternatives.
- patterns/multi-cloud-llm-serving — composes outside; the fallback hierarchy can include cross-cloud backups.
- patterns/api-normalization-layer-cross-provider — composes alongside; the breaker depends on unified health/error vocabulary.
- patterns/provisioned-throughput-with-on-demand-spillover — composes orthogonally; spillover handles capacity-ceiling events while fallback handles degradation events.
Trade-offs¶
| Compared to… | Wins | Loses |
|---|---|---|
| No fallback | Provider degradation events become invisible to users | Configuration complexity per feature; testing all fallback paths |
| Single backup, no breaker | Simpler failover | No automated recovery; manual intervention to restore primary |
| Manual operator failover | Full human judgment | Slow; on-call burden; depends on detection latency |
| Multi-cloud LLM serving | Provider redundancy beyond model redundancy | More operational complexity; multi-cloud composes around this pattern |
Trade-offs of partial-open recovery¶
Partial-open ramp prevents thundering-herd-on-recovery (concepts/thundering-herd) at the cost of a longer recovery duration. Slack's framing verbatim:
"This ensures a graceful recovery without overwhelming a stabilizing service."
Risks and mitigations¶
- Backup model quality regression — the system fails over but produces lower-quality output. Mitigation: per-feature primary + backup chosen with quality parity in mind; backup is "high-quality", not "any working".
- Cascading failover — primary degrades, backup gets flooded, backup degrades too. Mitigation: the breaker on each endpoint is independent; flooded backup trips its own breaker; routing falls to the next-tier backup.
- Stale health data — breaker decisions based on outdated signals. Mitigation: high-frequency health signal updates; short evaluation windows.
- Customer-feedback trend is a slow signal — the "downward trend in customer feedback" takes minutes-to-hours to detect. Mitigation: composed with TTFT/p90 fast signals; feedback drives long-tail re-evaluation, not real-time rerouting.
Risks and mitigations specific to partial-open¶
- Ramp too fast → re-trips on recovering endpoint. Mitigation: conservative ramp curve; require multiple consecutive sustained-health windows.
- Ramp too slow → wastes capacity; needlessly degraded user experience. Mitigation: exponential ramp with health- based gating.
- Stuck in partial-open → endpoint never recovers fully. Mitigation: timeout that escalates to OPEN if ramp doesn't reach CLOSED in N minutes.
Seen in¶
- sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of model fallback hierarchy with circuit breaker as Slack's Phase 3+ resilience pattern, generalised at Phase 4 to cross-cloud fallback. Verbatim TTFT + throttling + customer feedback regression triggers; partial-open recovery state with dynamic ramp expansion framing.
Related¶
- concepts/automated-circuit-breaker-with-partial-open-state
- concepts/model-to-feature-binding
- concepts/concentration-risk-single-cloud-llm
- concepts/multi-cloud-llm-serving
- systems/slack-intelligent-routing-layer
- systems/slack-ai
- patterns/circuit-breaker
- patterns/multi-cloud-llm-serving
- patterns/api-normalization-layer-cross-provider