SYSTEM Cited by 4 sources
Cloudflare AI Gateway¶
AI Gateway is Cloudflare's LLM proxy tier that sits between an application and any of the supported AI providers (Anthropic, OpenAI, Google, Workers AI, and others). It gives developers centralised visibility, per-request logging, cost tracking, rate limiting, and — configured as a drop-in — the ability to change the LLM model or provider without redeploying application code (see patterns/ai-gateway-provider-abstraction).
Core capabilities¶
- Unified catalog + unified binding (2026-04-16). All models
— first-party Workers AI
@cf/…, 70+ third-party models across 12+ providers (Anthropic, OpenAI, Google, Alibaba Cloud, AssemblyAI, Bytedance, InWorld, MiniMax, Pixverse, Recraft, Runway, Vidu), and BYO Cog containers — are callable through oneenv.AI.run(model_string, ...)binding. Provider selector lives inside the model string; switching providers is a one-line edit. REST API for non-Workers callers promised in the weeks after launch (concepts/unified-model-catalog, patterns/unified-inference-binding). - Multimodal, not just text. Image, video, and speech models surface alongside text LLMs in the same catalog.
- Provider abstraction. Application points at the gateway endpoint
(e.g.
ANTHROPIC_BASE_URL=<gateway>); swapping models/providers happens in gateway config, not application deploys. - Bring-Your-Own-Key (BYOK). Instead of shipping provider secrets with every request, Secrets Store integration lets Cloudflare inject the key server-side on behalf of the caller (concepts/byok-bring-your-own-key).
- Unified Billing. Alternative to BYOK: top up a Cloudflare account
with credits and Cloudflare pays providers directly, deducting from the
credit balance — no provider-secrets management at all. Per-request
custom metadata (
metadata: { teamId, userId, ... }) enables spend attribution by user / tenant / workflow. - Automatic provider failover (2026-04-16). When a model is available on multiple providers and one goes down, the gateway silently routes to another — the application never writes retry-on-outage logic (patterns/automatic-provider-failover). Extends the configured fallback chain feature into an automatic cross-provider retry primitive.
- Buffered resumable streaming (2026-04-16). Streaming inference responses are buffered gateway-side, independently of the caller's lifetime. If an Agents SDK agent is interrupted mid-inference, it reconnects and retrieves the buffered response — no re-inference, no double-billing. Paired with Agents SDK checkpointing, "the end user never notices" a mid-turn crash (concepts/resilient-inference-stream, patterns/buffered-resumable-inference-stream).
- BYO-model via Cog containers (2026-04-16). Customers
build Replicate Cog containers
(
cog.yaml+predict.py:Predictor) and push them to Workers AI; the gateway surfaces them alongside first-party and third-party models in the same catalog. Currently Enterprise + design-partner access (patterns/byo-model-via-container). - Colo-with-inference latency. Cloudflare's 330-city
network means the gateway sits close to both users and
inference endpoints; for
@cf/…models the entire call path stays inside Cloudflare (no public-Internet hop), which is the load-bearing first-token-latency argument for agent workloads (concepts/time-to-first-token). - Provider / model fallbacks. Ordered list of providers/models tried on failure; the application's retry logic becomes a gateway config change.
- Observability. Request/response logs, spend by model, latency distribution, error rates — the same view across heterogeneous upstream providers.
Seen in¶
- sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents
— canonical unified-catalog launch. Same
env.AI.run(...)binding previously scoped to Workers AI@cf/…models now calls any of 70+ models across 12+ providers — one line of code to switch, one set of credits to pay. Post framing: "Today, we're making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable." Same post introduces automatic provider failover (gateway silently routes to a second provider when the first goes down — patterns/automatic-provider-failover), buffered resumable streaming (stream survives agent disconnects — concepts/resilient-inference-stream, patterns/buffered-resumable-inference-stream), and BYO-model via Cog containers (patterns/byo-model-via-container) pushed to Workers AI. Strategic context: Replicate team joined the Cloudflare AI Platform team, bringing multimodal-model-hosting DNA (image, video, speech) that explains the catalog expansion from text-LLM-dominated to multimodal. Named new providers: Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, Vidu. - sources/2026-01-29-cloudflare-moltworker-self-hosted-ai-agent —
canonical example of the zero-code-change provider swap. Porting
Moltbot to run on
Workers required only setting
ANTHROPIC_BASE_URLto the AI Gateway endpoint; Moltbot's code was unchanged, and the upgrade path to BYOK or Unified Billing is a gateway-config operation. - sources/2026-04-20-cloudflare-internal-ai-engineering-stack — AI Gateway as Cloudflare's internal platform-layer choke point: every LLM call from internal tooling (OpenCode, Windsurf, AI Code Reviewer, Dynamic Workers) flows through a Worker that validates a Zero Trust Access JWT, tags the request with an anonymous per-user UUID, and forwards to AI Gateway. Reported: 20.18M req/month, 241.37B tokens, 91% frontier labs / 9% Workers AI.
- sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — AI Gateway is called out as a still-separately-billed companion during AI Search's open beta alongside Workers AI: "Workers AI and AI Gateway usage will continue to be billed separately." Sits alongside AI Search as the LLM-proxy tier in the agent stack (retrieval is AI Search; inference is Workers AI through AI Gateway).
Related¶
- systems/cloudflare-workers — the compute tier calling AI Gateway.
- systems/workers-ai — first-party inference tier whose
@cf/…catalog now sits alongside third-party providers in the unified catalog. - systems/cloudflare-agents-sdk — the beneficiary of the resumable-streaming guarantee.
- systems/replicate-cog — BYO-model container format.
- systems/cloudflare-zero-trust-access — gates calling applications in Cloudflare's internal dogfood architecture.
- concepts/byok-bring-your-own-key — the secrets posture AI Gateway supports.
- concepts/unified-model-catalog — the product-surface property the 2026-04-16 launch productises.
- concepts/resilient-inference-stream — the stream-lifetime-decoupling property.
- patterns/ai-gateway-provider-abstraction — the swap-providers-without-redeploying umbrella pattern.
- patterns/unified-inference-binding — the SDK-surface sharpening.
- patterns/automatic-provider-failover — the cross-provider reliability primitive.
- patterns/buffered-resumable-inference-stream — the caller-side reliability primitive.
- patterns/byo-model-via-container — the BYO substrate pattern.
- companies/cloudflare — operator.