Skip to content

PATTERN Cited by 1 source

Model fallback hierarchy with circuit breaker

Pattern

For every LLM-powered feature, designate a primary model and one or more backup models in a defined fallback hierarchy. Combine with an automated circuit breaker that monitors endpoint-level health signals (TTFT, p90 latency, 5xx error rate) in real time and reroutes traffic to the next model in the hierarchy when the primary degrades — with a partial-open recovery state that gradually ramps traffic back to the recovering endpoint.

The canonical wiki implementation: the fallback + circuit breaker subsystems of Slack's Intelligent Routing Layer (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud):

"We developed a model hierarchy for every AI feature, allowing our system to automatically fall back to different models if the primary model reached a degraded state. Some examples of regressions are elevated time to first token latencies, throttling errors, and downward trend in customer feedback. […] If a specific model was underperforming or hitting limits in one region, the platform would reroute the request in real-time to another healthy endpoint. From the customer's perspective, the experience remained seamless; they continued to receive high-quality results without ever knowing a complex failover had occurred behind the scenes."

When to use it

  • LLM-powered features at production scale — millions of user-visible requests where any single model degradation has customer impact.
  • Multiple comparable-quality models available — primary + backup pairs require enough catalogue depth that the backup is not catastrophically worse than the primary.
  • Real-time health signals available — the breaker requires per-endpoint TTFT / p90 / error rate streams.
  • Provider-side failures are non-zero — even reliable providers have model-level degradations, throttling spikes, and capacity exhaustion events.

When NOT to use it

  • One-model deployments — the pattern is meaningless without designated backups.
  • Very low-volume features — the breaker's signal-to-noise ratio is poor at low request counts.
  • Latency budgets too tight for routing-layer overhead — most LLM workloads have headroom; some real-time inference doesn't.

Three structural pieces

Per-feature configuration:
  feature: "ai_search"
  models:
    - { provider: A, sku: high-reasoning-v3, role: primary }
    - { provider: B, sku: high-reasoning-v2, role: backup_1 }
    - { provider: A, sku: high-reasoning-v2, role: backup_2 }
  health_thresholds:
    ttft_p90_ms: <undisclosed>
    p90_latency_ms: <undisclosed>
    error_rate_pct: <undisclosed>

           ┌─────────────────────┐
   Request │ Routing Layer       │
   ────────│   1. Pick model     │
           │      from hierarchy │
           │   2. Check breaker  │
           │      state          │
           │   3. Forward / fall │
           │      to next        │
           └─────────────────────┘
        ┌───────────┼───────────┐
        ▼           ▼           ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Primary  │ │ Backup 1 │ │ Backup 2 │
  │ CLOSED   │ │ CLOSED   │ │ CLOSED   │
  └──────────┘ └──────────┘ └──────────┘
       ├─[degrades: TTFT↑, p90↑, 5xx↑]
  ┌──────────┐
  │ Primary  │  reroute to Backup 1
  │ OPEN     │
  └──────────┘
       ├─[cooldown]
  ┌──────────┐
  │ Primary  │  trickle of requests
  │ PARTIAL- │  starts; expands as
  │ OPEN     │  health is sustained
  └──────────┘

The pattern requires:

  1. Per-feature primary + backup designation — every feature has a defined fallback hierarchy in configuration, not in code.
  2. Real-time health monitoring at endpoint level — TTFT, p90 latency, 5xx error rate streams updated continuously.
  3. Automated circuit breaker with partial-open ramp — see concepts/automated-circuit-breaker-with-partial-open-state.

What "regression" means (Slack disclosed)

The 2026-05-28 source enumerates three named regression signals:

  • Elevated TTFT"elevated time to first token latencies".
  • Throttling errors — provider rate-limit / quota exhaustion.
  • Downward trend in customer feedback — Slack signals it treats user feedback as a first-class soft-failure signal, not just hard-fail metrics.

The third is novel — Slack's verbatim "redefining the meaning of 'Failure'" framing canonicalises soft failures (p90 spikes, feedback trends) as breaker triggers alongside hard errors.

Composition with other patterns

Trade-offs

Compared to… Wins Loses
No fallback Provider degradation events become invisible to users Configuration complexity per feature; testing all fallback paths
Single backup, no breaker Simpler failover No automated recovery; manual intervention to restore primary
Manual operator failover Full human judgment Slow; on-call burden; depends on detection latency
Multi-cloud LLM serving Provider redundancy beyond model redundancy More operational complexity; multi-cloud composes around this pattern

Trade-offs of partial-open recovery

Partial-open ramp prevents thundering-herd-on-recovery (concepts/thundering-herd) at the cost of a longer recovery duration. Slack's framing verbatim:

"This ensures a graceful recovery without overwhelming a stabilizing service."

Risks and mitigations

  • Backup model quality regression — the system fails over but produces lower-quality output. Mitigation: per-feature primary + backup chosen with quality parity in mind; backup is "high-quality", not "any working".
  • Cascading failover — primary degrades, backup gets flooded, backup degrades too. Mitigation: the breaker on each endpoint is independent; flooded backup trips its own breaker; routing falls to the next-tier backup.
  • Stale health data — breaker decisions based on outdated signals. Mitigation: high-frequency health signal updates; short evaluation windows.
  • Customer-feedback trend is a slow signal — the "downward trend in customer feedback" takes minutes-to-hours to detect. Mitigation: composed with TTFT/p90 fast signals; feedback drives long-tail re-evaluation, not real-time rerouting.

Risks and mitigations specific to partial-open

  • Ramp too fast → re-trips on recovering endpoint. Mitigation: conservative ramp curve; require multiple consecutive sustained-health windows.
  • Ramp too slow → wastes capacity; needlessly degraded user experience. Mitigation: exponential ramp with health- based gating.
  • Stuck in partial-open → endpoint never recovers fully. Mitigation: timeout that escalates to OPEN if ramp doesn't reach CLOSED in N minutes.

Seen in

  • sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of model fallback hierarchy with circuit breaker as Slack's Phase 3+ resilience pattern, generalised at Phase 4 to cross-cloud fallback. Verbatim TTFT + throttling + customer feedback regression triggers; partial-open recovery state with dynamic ramp expansion framing.
Last updated · 542 distilled / 1,571 read