SYSTEM Cited by 5 sources

Workers AI¶

Overview¶

Workers AI is Cloudflare's managed LLM + embedding + reranker inference platform on the developer platform — models served from the @cf/… namespace, bound to a Worker via the ai binding:

// wrangler.jsonc
{ "ai": { "binding": "AI" } }

Models include:

LLMs — @cf/moonshotai/kimi-k2.5 (systems/kimi-k2-5) as the chat model in the 2026-04-16 AI Search support-agent example + the canonical extra-large-model-serving example in the 2026-04-16 high-performance-LLMs post.
Rerankers — @cf/baai/bge-reranker-base as the default cross-encoder reranker option for AI Search hybrid retrieval.
Embeddings, image generation, speech, … — the broader model catalog.

Role in the stack¶

For the Agents SDK: invoked via createWorkersAI({ binding: this.env.AI }) + workersai("@cf/moonshotai/kimi-k2.5") passed to the Vercel ai SDK's streamText.
For AI Search: the reranker stage runs a Workers AI model when reranking: true is set on an instance. Embeddings for the vector half are also Workers-AI-served (model not named in the 2026-04-16 post).

Billing: Workers AI and AI Gateway usage are billed separately from the AI Search product itself, even during AI Search's open beta.

High-performance serving stack (2026-04-16 post)¶

The "Building the foundation for running extra-large language models" post (2026-04-16) discloses the internal serving architecture behind Workers AI for extra-large models like Kimi K2.5 (>1T parameters, ~560 GB weights). Four load-bearing pieces:

1. Prefill/Decode disaggregation ¶

Separate inference servers for the compute-bound prefill stage (processes input tokens, populates KV cache) and the memory-bound decode stage (generates output tokens). Non-trivial load balancer does:

Two-hop routing (prefill → decode).
KV-transfer metadata passing across stages.
SSE response rewrite (decode's stream augmented with prefill-side cached-token counts before reaching the client).
Token-aware admission — tracks in-flight tokens per stage pool separately.

Measured effect: p90 TTFT dropped; p90 intertoken latency: ~100 ms with high variance → 20-30 ms (3× improvement), using the same quantity of GPUs while request volume increased. See concepts/prefill-decode-disaggregation, patterns/disaggregated-inference-stages. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

2. Session-affinity prompt caching (`x-session-affinity` header)¶

Clients signal per-session opaque tokens; Workers AI routes continuation requests back to the replica with hot KV cache for the conversation prefix. Incentivised by discounted cached-token pricing. Integrated into agent harnesses via PRs like OpenCode #20744.

Measured effect: peak input-cache-hit ratio 60% → 80% after heavy internal users adopted the header. "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." See concepts/session-affinity-prompt-caching, patterns/session-affinity-header. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

3. Cluster-wide shared KV cache over RDMA (Mooncake Transfer Engine + Mooncake Store + LMCache / SGLang HiCache)¶

For models that span multiple GPUs (required for extra-large models), KV cache is shared cluster-wide via RDMA over NVLink + NVMe-oF. Mooncake Store extends cache onto NVMe storage (longer session residency). LMCache or SGLang HiCache is the software layer that exposes the shared cache to the serving engine.

Consequence: "This eliminates the need for session aware routing within a cluster" — cross-cluster, x-session-affinity still matters. See concepts/rdma-kv-transfer, concepts/multi-gpu-serving, patterns/kv-aware-routing. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

4. Speculative decoding with NVIDIA EAGLE-3 ¶

Drafter model nvidia/Kimi-K2.5-Thinking-Eagle3 drafts N future tokens; Kimi K2.5 verifies them in one parallel forward pass. "In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." See concepts/speculative-decoding, systems/eagle-3. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

5. Proprietary Rust inference engine: Infire ¶

Cloudflare's own engine (Rust, originally Birthday Week 2025) runs on both prefill and decode tiers. Multi-GPU via pipeline, tensor, and expert parallelism. Much lower activation-memory overhead than vLLM — can fit Llama 4 Scout on 2× H200 with >56 GiB KV room (~1.2M tokens), Kimi K2.5 on 8× H100 (not H200) with >30 GiB KV room. Sub-20s cold boot. Up to +20% tokens/sec vs baseline on unconstrained systems. "In both cases you would have trouble even booting vLLM in the first place." See systems/infire. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Unified catalog + BYO model (2026-04-16 AI Platform post)¶

The 2026-04-16 "Cloudflare's AI Platform: an inference layer designed for agents" post repositions Workers AI from "catalog of @cf/… open-source models" to "one binding for any model from any provider, plus your own":

Same env.AI.run() binding calls third-party providers. env.AI.run('anthropic/claude-opus-4-6', { input: ... }, { gateway: { id: 'default' } }) is a one-line change from @cf/... — see patterns/unified-inference-binding. Callers fall through AI Gateway transparently.
70+ models across 12+ providers on the catalog — named: Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, Vidu — including image, video, and speech models (not just text LLMs). Canonical concepts/unified-model-catalog instance.
REST API for non-Workers callers committed for the coming weeks.
BYO-model via Replicate Cog containers. Customer writes cog.yaml + predict.py:Predictor + cog build, pushes to Workers AI, and the model surfaces in the same AI Gateway catalog — see patterns/byo-model-via-container. Current scope: Enterprise + design-partner access; roadmap: customer-facing push APIs, wrangler commands, and faster cold starts via GPU snapshotting.
Strategic context: the Replicate team has joined the Cloudflare AI Platform team ("we don't even consider ourselves separate teams anymore"); Replicate models are being brought onto AI Gateway and the hosted models are being replatformed onto Cloudflare infrastructure — this is what explains the catalog expansion from text-LLM-dominated to multimodal.
Network-topology argument: "When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network." The @cf/… path remains the fastest TTFT path for latency-critical agent workloads.

Agentic-workload tuning posture¶

The serving stack's tuning knobs are explicitly optimised for agentic traffic shape: large system prompt + tool descriptions + MCP server metadata + growing conversation history → input-heavy with long reusable prefixes. The two things that matter most:

Fast input-token processing — PD disaggregation + cluster-wide KV sharing + session affinity address this.
Fast tool-call generation — EAGLE-3 speculative decoding shines here because tool-call output is structurally predictable (JSON envelope, known schema).

"After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer's use cases." Kimi K2.5 made 3× faster post-launch by retuning configuration for observed traffic shape, not by more hardware. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Weight-compression lever: Unweight (2026-04-17)¶

Cloudflare adds Unweight as the weight-side VRAM-reduction lever on Workers AI: Huffman coding on the redundant BF16 exponent byte paired with a custom reconstructive matmul kernel that feeds tensor cores directly from SMEM. Llama-3.1-8B results: ~22 % model-size reduction (~3 GB VRAM saved per instance); bit-exact lossless; H100-only at launch. Complements Infire's activation-memory discipline — Unweight on weights, Infire on activations, savings additive into KV-cache headroom + serving density. Open-source kernels ship as systems/unweight-kernels. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Seen in¶

sources/2026-05-19-cloudflare-announcing-claude-managed-agents-on-cloudflare — image_generate agent tool delivered by Workers AI in the Claude Managed Agents integration. Quoting the post: "image_generate, which uses Workers AI to generate images on Cloudflare. This pairs well with Claude providing text-based inference." Architectural shape: the brain (Claude) handles text reasoning and tool selection; image generation is offloaded to a Cloudflare-hosted multimodal model, called via the operator's outbound Worker proxy. Multimodal-via-tool-call rather than multimodal-via- single-model.
sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — canonical unified-catalog + BYO-model launch. Same env.AI.run() binding previously scoped to @cf/… now calls 70+ models across 12+ providers with a one-line provider swap; BYO-model via Cog containers; automatic provider failover and buffered resumable streaming become gateway-owned concerns so the caller doesn't write retry logic. Multimodal catalog expansion (image, video, speech) alongside text LLMs.
sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — named as the inference substrate for both the chat LLM (Kimi K2.5) and the hybrid-search cross-encoder reranker (bge-reranker-base).
sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — deep dive on the high-performance serving stack: PD disaggregation, session affinity, Mooncake-based cluster KV sharing, EAGLE-3 speculative decoding, Infire proprietary inference engine, multi-GPU parallelism for extra-large models. Canonical wiki instance of the five serving primitives above.
sources/2026-04-16-cloudflare-email-service-public-beta-ready-for-agents — Workers AI as the inbound email classification substrate for Agentic Inbox. The reference app runs inbound mail through a Workers-AI classifier before persisting + replying — canonical wiki instance of the "classify" stage of patterns/inbound-classify-persist-reply-pipeline.
sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical lossless-weight-compression lever on Workers AI; ~22 % model-size reduction on Llama-3.1-8B via Huffman coding on BF16 exponents + fused tensor-core reconstruction. H100 only at launch. Pairs with Infire's activation-memory discipline.

systems/cloudflare-workers — host runtime + binding layer.
systems/cloudflare-ai-gateway — optional proxy / observability layer in front of Workers AI or third-party providers. As of 2026-04-16 the unifying catalog surface for first-party + third-party + BYO models.
systems/cloudflare-ai-search — consumer of the reranker + embedding models.
systems/replicate-cog — BYO-model container format for pushing custom inference code to Workers AI.
systems/kimi-k2-5 — specific extra-large model served.
systems/cloudflare-agents-sdk — canonical consumer in agent workloads.
systems/infire — Cloudflare's Rust inference engine inside Workers AI.
systems/unweight — lossless weight-compression lever complementing Infire's activation-memory discipline.
systems/unweight-kernels — open-source CUDA kernels backing Unweight.
systems/mooncake-transfer-engine / systems/mooncake-store — KV transport + NVMe tier.
systems/eagle-3 — speculative-decoding drafter.
systems/lmcache / systems/sglang — cluster-wide cache layers.
concepts/unified-model-catalog — the product-surface property the 2026-04-16 launch realises on Workers AI.
concepts/cross-encoder-reranking — reranker model served on this platform.
concepts/prefill-decode-disaggregation / concepts/session-affinity-prompt-caching / concepts/speculative-decoding / concepts/multi-gpu-serving / concepts/rdma-kv-transfer / concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/expert-parallelism
patterns/unified-inference-binding — the env.AI.run() one-line-swap pattern.
patterns/byo-model-via-container — the Cog-based BYO substrate pattern.
companies/cloudflare — parent org.