Skip to content

SYSTEM Cited by 4 sources

Cloudflare AI Gateway

AI Gateway is Cloudflare's LLM proxy tier that sits between an application and any of the supported AI providers (Anthropic, OpenAI, Google, Workers AI, and others). It gives developers centralised visibility, per-request logging, cost tracking, rate limiting, and — configured as a drop-in — the ability to change the LLM model or provider without redeploying application code (see patterns/ai-gateway-provider-abstraction).

Core capabilities

  • Unified catalog + unified binding (2026-04-16). All models — first-party Workers AI @cf/…, 70+ third-party models across 12+ providers (Anthropic, OpenAI, Google, Alibaba Cloud, AssemblyAI, Bytedance, InWorld, MiniMax, Pixverse, Recraft, Runway, Vidu), and BYO Cog containers — are callable through one env.AI.run(model_string, ...) binding. Provider selector lives inside the model string; switching providers is a one-line edit. REST API for non-Workers callers promised in the weeks after launch (concepts/unified-model-catalog, patterns/unified-inference-binding).
  • Multimodal, not just text. Image, video, and speech models surface alongside text LLMs in the same catalog.
  • Provider abstraction. Application points at the gateway endpoint (e.g. ANTHROPIC_BASE_URL=<gateway>); swapping models/providers happens in gateway config, not application deploys.
  • Bring-Your-Own-Key (BYOK). Instead of shipping provider secrets with every request, Secrets Store integration lets Cloudflare inject the key server-side on behalf of the caller (concepts/byok-bring-your-own-key).
  • Unified Billing. Alternative to BYOK: top up a Cloudflare account with credits and Cloudflare pays providers directly, deducting from the credit balance — no provider-secrets management at all. Per-request custom metadata (metadata: { teamId, userId, ... }) enables spend attribution by user / tenant / workflow.
  • Automatic provider failover (2026-04-16). When a model is available on multiple providers and one goes down, the gateway silently routes to another — the application never writes retry-on-outage logic (patterns/automatic-provider-failover). Extends the configured fallback chain feature into an automatic cross-provider retry primitive.
  • Buffered resumable streaming (2026-04-16). Streaming inference responses are buffered gateway-side, independently of the caller's lifetime. If an Agents SDK agent is interrupted mid-inference, it reconnects and retrieves the buffered response — no re-inference, no double-billing. Paired with Agents SDK checkpointing, "the end user never notices" a mid-turn crash (concepts/resilient-inference-stream, patterns/buffered-resumable-inference-stream).
  • BYO-model via Cog containers (2026-04-16). Customers build Replicate Cog containers (cog.yaml + predict.py:Predictor) and push them to Workers AI; the gateway surfaces them alongside first-party and third-party models in the same catalog. Currently Enterprise + design-partner access (patterns/byo-model-via-container).
  • Colo-with-inference latency. Cloudflare's 330-city network means the gateway sits close to both users and inference endpoints; for @cf/… models the entire call path stays inside Cloudflare (no public-Internet hop), which is the load-bearing first-token-latency argument for agent workloads (concepts/time-to-first-token).
  • Provider / model fallbacks. Ordered list of providers/models tried on failure; the application's retry logic becomes a gateway config change.
  • Observability. Request/response logs, spend by model, latency distribution, error rates — the same view across heterogeneous upstream providers.

Seen in

  • sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agentscanonical unified-catalog launch. Same env.AI.run(...) binding previously scoped to Workers AI @cf/… models now calls any of 70+ models across 12+ providers — one line of code to switch, one set of credits to pay. Post framing: "Today, we're making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable." Same post introduces automatic provider failover (gateway silently routes to a second provider when the first goes down — patterns/automatic-provider-failover), buffered resumable streaming (stream survives agent disconnects — concepts/resilient-inference-stream, patterns/buffered-resumable-inference-stream), and BYO-model via Cog containers (patterns/byo-model-via-container) pushed to Workers AI. Strategic context: Replicate team joined the Cloudflare AI Platform team, bringing multimodal-model-hosting DNA (image, video, speech) that explains the catalog expansion from text-LLM-dominated to multimodal. Named new providers: Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, Vidu.
  • sources/2026-01-29-cloudflare-moltworker-self-hosted-ai-agent — canonical example of the zero-code-change provider swap. Porting Moltbot to run on Workers required only setting ANTHROPIC_BASE_URL to the AI Gateway endpoint; Moltbot's code was unchanged, and the upgrade path to BYOK or Unified Billing is a gateway-config operation.
  • sources/2026-04-20-cloudflare-internal-ai-engineering-stack — AI Gateway as Cloudflare's internal platform-layer choke point: every LLM call from internal tooling (OpenCode, Windsurf, AI Code Reviewer, Dynamic Workers) flows through a Worker that validates a Zero Trust Access JWT, tags the request with an anonymous per-user UUID, and forwards to AI Gateway. Reported: 20.18M req/month, 241.37B tokens, 91% frontier labs / 9% Workers AI.
  • sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — AI Gateway is called out as a still-separately-billed companion during AI Search's open beta alongside Workers AI: "Workers AI and AI Gateway usage will continue to be billed separately." Sits alongside AI Search as the LLM-proxy tier in the agent stack (retrieval is AI Search; inference is Workers AI through AI Gateway).
Last updated · 200 distilled / 1,178 read