Skip to content

SYSTEM Cited by 4 sources

Workers AI

Overview

Workers AI is Cloudflare's managed LLM + embedding + reranker inference platform on the developer platform — models served from the @cf/… namespace, bound to a Worker via the ai binding:

// wrangler.jsonc
{ "ai": { "binding": "AI" } }

Models include:

  • LLMs@cf/moonshotai/kimi-k2.5 (systems/kimi-k2-5) as the chat model in the 2026-04-16 AI Search support-agent example + the canonical extra-large-model-serving example in the 2026-04-16 high-performance-LLMs post.
  • Rerankers@cf/baai/bge-reranker-base as the default cross-encoder reranker option for AI Search hybrid retrieval.
  • Embeddings, image generation, speech, … — the broader model catalog.

Role in the stack

  • For the Agents SDK: invoked via createWorkersAI({ binding: this.env.AI }) + workersai("@cf/moonshotai/kimi-k2.5") passed to the Vercel ai SDK's streamText.
  • For AI Search: the reranker stage runs a Workers AI model when reranking: true is set on an instance. Embeddings for the vector half are also Workers-AI-served (model not named in the 2026-04-16 post).

Billing: Workers AI and AI Gateway usage are billed separately from the AI Search product itself, even during AI Search's open beta.

High-performance serving stack (2026-04-16 post)

The "Building the foundation for running extra-large language models" post (2026-04-16) discloses the internal serving architecture behind Workers AI for extra-large models like Kimi K2.5 (>1T parameters, ~560 GB weights). Four load-bearing pieces:

1. Prefill/Decode disaggregation

Separate inference servers for the compute-bound prefill stage (processes input tokens, populates KV cache) and the memory-bound decode stage (generates output tokens). Non-trivial load balancer does:

  • Two-hop routing (prefill → decode).
  • KV-transfer metadata passing across stages.
  • SSE response rewrite (decode's stream augmented with prefill-side cached-token counts before reaching the client).
  • Token-aware admission — tracks in-flight tokens per stage pool separately.

Measured effect: p90 TTFT dropped; p90 intertoken latency: ~100 ms with high variance → 20-30 ms (3× improvement), using the same quantity of GPUs while request volume increased. See concepts/prefill-decode-disaggregation, patterns/disaggregated-inference-stages. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

2. Session-affinity prompt caching (x-session-affinity header)

Clients signal per-session opaque tokens; Workers AI routes continuation requests back to the replica with hot KV cache for the conversation prefix. Incentivised by discounted cached-token pricing. Integrated into agent harnesses via PRs like OpenCode #20744.

Measured effect: peak input-cache-hit ratio 60% → 80% after heavy internal users adopted the header. "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." See concepts/session-affinity-prompt-caching, patterns/session-affinity-header. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

3. Cluster-wide shared KV cache over RDMA (Mooncake Transfer Engine + Mooncake Store + LMCache / SGLang HiCache)

For models that span multiple GPUs (required for extra-large models), KV cache is shared cluster-wide via RDMA over NVLink + NVMe-oF. Mooncake Store extends cache onto NVMe storage (longer session residency). LMCache or SGLang HiCache is the software layer that exposes the shared cache to the serving engine.

Consequence: "This eliminates the need for session aware routing within a cluster" — cross-cluster, x-session-affinity still matters. See concepts/rdma-kv-transfer, concepts/multi-gpu-serving, patterns/kv-aware-routing. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

4. Speculative decoding with NVIDIA EAGLE-3

Drafter model nvidia/Kimi-K2.5-Thinking-Eagle3 drafts N future tokens; Kimi K2.5 verifies them in one parallel forward pass. "In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." See concepts/speculative-decoding, systems/eagle-3. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

5. Proprietary Rust inference engine: Infire

Cloudflare's own engine (Rust, originally Birthday Week 2025) runs on both prefill and decode tiers. Multi-GPU via pipeline, tensor, and expert parallelism. Much lower activation-memory overhead than vLLM — can fit Llama 4 Scout on 2× H200 with >56 GiB KV room (~1.2M tokens), Kimi K2.5 on 8× H100 (not H200) with >30 GiB KV room. Sub-20s cold boot. Up to +20% tokens/sec vs baseline on unconstrained systems. "In both cases you would have trouble even booting vLLM in the first place." See systems/infire. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Unified catalog + BYO model (2026-04-16 AI Platform post)

The 2026-04-16 "Cloudflare's AI Platform: an inference layer designed for agents" post repositions Workers AI from "catalog of @cf/… open-source models" to "one binding for any model from any provider, plus your own":

  • Same env.AI.run() binding calls third-party providers. env.AI.run('anthropic/claude-opus-4-6', { input: ... }, { gateway: { id: 'default' } }) is a one-line change from @cf/... — see patterns/unified-inference-binding. Callers fall through AI Gateway transparently.
  • 70+ models across 12+ providers on the catalog — named: Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, Vidu — including image, video, and speech models (not just text LLMs). Canonical concepts/unified-model-catalog instance.
  • REST API for non-Workers callers committed for the coming weeks.
  • BYO-model via Replicate Cog containers. Customer writes cog.yaml + predict.py:Predictor + cog build, pushes to Workers AI, and the model surfaces in the same AI Gateway catalog — see patterns/byo-model-via-container. Current scope: Enterprise + design-partner access; roadmap: customer-facing push APIs, wrangler commands, and faster cold starts via GPU snapshotting.
  • Strategic context: the Replicate team has joined the Cloudflare AI Platform team ("we don't even consider ourselves separate teams anymore"); Replicate models are being brought onto AI Gateway and the hosted models are being replatformed onto Cloudflare infrastructure — this is what explains the catalog expansion from text-LLM-dominated to multimodal.
  • Network-topology argument: "When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network." The @cf/… path remains the fastest TTFT path for latency-critical agent workloads.

Agentic-workload tuning posture

The serving stack's tuning knobs are explicitly optimised for agentic traffic shape: large system prompt + tool descriptions + MCP server metadata + growing conversation history → input-heavy with long reusable prefixes. The two things that matter most:

  1. Fast input-token processing — PD disaggregation + cluster-wide KV sharing + session affinity address this.
  2. Fast tool-call generation — EAGLE-3 speculative decoding shines here because tool-call output is structurally predictable (JSON envelope, known schema).

"After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer's use cases." Kimi K2.5 made 3× faster post-launch by retuning configuration for observed traffic shape, not by more hardware. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Weight-compression lever: Unweight (2026-04-17)

Cloudflare adds Unweight as the weight-side VRAM-reduction lever on Workers AI: Huffman coding on the redundant BF16 exponent byte paired with a custom reconstructive matmul kernel that feeds tensor cores directly from SMEM. Llama-3.1-8B results: ~22 % model-size reduction (~3 GB VRAM saved per instance); bit-exact lossless; H100-only at launch. Complements Infire's activation-memory discipline — Unweight on weights, Infire on activations, savings additive into KV-cache headroom + serving density. Open-source kernels ship as systems/unweight-kernels. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Seen in

Last updated · 200 distilled / 1,178 read