Skip to content

PATTERN Cited by 1 source

Automatic provider failover

Definition

Automatic provider failover is the pattern of routing an LLM / inference call to a second provider without application- level retry code when the first provider becomes unavailable or returns a failure, exploiting the property that the same model is often available on multiple providers. The failover decision, health tracking, and retry are owned by an intermediary (the AI gateway), not by the application.

The problem it solves

Traditional application-level retry logic for LLM calls:

try {
  return await anthropic.messages.create({...});
} catch (err) {
  if (isAnthropicOutage(err)) {
    return await openai.chat.completions.create({...translated(args)});
  }
  throw err;
}

This code:

  • has to duplicate per-provider SDK imports,
  • has to translate arguments between provider API shapes (Anthropic tool-calling format ≠ OpenAI tool-calling format ≠ Gemini tool-calling format),
  • has to classify errors per provider,
  • has to maintain per-provider rate-limit / circuit-breaker state,
  • is present in every call site,
  • compounds with the other concerns (auth, logging, spend attribution) unless the application is carefully architected.

For agent workloads — where a single user task may chain 10+ inference calls — this retry complexity sits inside every step and any missed case becomes a cascade of downstream failures. The 2026-04-16 Cloudflare post names this cascade explicitly as the motivation: "An agent might chain ten calls together to complete a single task ... One failed request isn't a retry, but suddenly a cascade of downstream failures."

Mechanism

  • The application calls a single gateway endpoint (via a unified binding or a gateway URL).
  • The gateway maintains a health view of each upstream provider.
  • On failure from provider A, the gateway translates the request to provider B's API shape (when the same model is available), retries, streams the response back to the caller.
  • The caller sees a single response; the failover event is a gateway log line, not an application error.

The 2026-04-16 post's framing:

"Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own."

This is explicitly framed as an agent-motivated feature: "Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain."

Prerequisites

  1. A unified catalog with the same model surfaced from multiple providers (concepts/unified-model-catalog). Failover is only meaningful when "equivalent" is well-defined — identical model, close-enough model, or a chain of models of descending quality.
  2. Per-provider health view. The gateway has to know which providers are degraded, at what latency, at what error rate — continuously.
  3. Argument translation. The gateway owns the per-provider request-shape mapping (prompt format, tool-calling schema, stop-sequences, streaming protocol) so it can replay the same logical call against a different provider without asking the caller to re-format.
  4. Error-classification policy. Not every provider error should trigger failover — a 400 (bad request) on provider A will also fail on provider B, but a 503 (overloaded) or a timeout probably won't. The gateway enforces the classification so the application doesn't.

Trade-offs

  • Model equivalence is approximate. The same weights can behave differently per provider due to quantisation differences, system-prompt defaults, or tokeniser subtleties. An automatic failover quietly swaps one for the other — callers may see response-quality drift without noticing.
  • Feature asymmetry. Anthropic's cache_control, OpenAI's json_schema, Gemini's safety_settings — features available on provider A may not exist on provider B. Failover has to decide whether to strip those features or return a failover-not-applicable error (which the application then has to handle — partially re-introducing the problem the pattern was meant to solve).
  • Double-billing risk. If the first provider actually returned tokens before failing, naive retry pays twice. The pattern composes well with buffered-resumable-inference-stream to manage this — the gateway owns the stream, so partial successes can be quarantined.
  • Observability trap. The application sees no failures, so it doesn't alert on provider-outage frequency — the gateway's dashboard is now the only place an operator can see provider-reliability drift. Teams that over-trust the gateway can miss systemic upstream degradation.
  • Failover latency amplification. A failover round-trip adds latency (detection budget + retry). For interactive agent use cases where TTFT is the feel metric (concepts/time-to-first-token), an aggressive failover budget trades some TTFT for resilience. The 2026-04-16 post doesn't disclose the budget.

Seen in

  • sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — canonical wiki instance. AI Gateway automatically routes to an available provider when the configured one fails, with the agent-chain-amplification argument as the motivation. Failover is framed as the gateway owning application-level retry logic; related Cloudflare AI Gateway fallbacks feature (fallbacks: [provider1, provider2, ...] configuration) is the configured-order mechanism the automatic-failover pattern extends when multiple providers share a model.

  • sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scaleCI-embedded instance with Hystrix-style circuit breaker + per-family failback chains. Cloudflare's AI Code Review tracks each model tier's health independently (three states: CLOSED / OPEN / HALF_OPEN; 2-min cooldown on OPEN; one probe request on HALF_OPEN to prevent stampedes). Failback walks a same-family chain (opus-4-7 → opus-4-6; sonnet-4-6 → sonnet-4-5) — never crosses families. An error classifier decides shouldFailback: retryable APIError (429, 503) → yes; ProviderAuthError / ContextOverflowError / MessageAbortedError / structured-output errors → no (a different model won't fix them). A coordinator-level second failback tier scans child-process stderr for "overloaded" / "503", hot-swaps opencode.json's review_coordinator.model on disk, and restarts. Sibling to the gateway-level failover but operates at child-process granularity.

Contrast with sibling patterns

Last updated · 200 distilled / 1,178 read