Building the foundation for running extra-large language models¶

Summary¶

Cloudflare's 2026-04-16 deep-dive on how Workers AI serves extra-large LLMs like Kimi K2.5 (~1T params, ~560 GB of weights). Four load-bearing pieces: (1) prefill/decode (PD) disaggregation — separate inference servers for the compute-bound prefill stage and memory-bound decode stage, with a token-aware load balancer that streams SSE responses while transferring KV cache between stages; (2) x-session-affinity session affinity header + cluster-wide KV cache sharing via Moonshot's Mooncake Transfer Engine + Mooncake Store over NVLink + NVMe-oF RDMA — extending cache from GPU VRAM onto NVMe, lifting peak input-cache-hit ratio from 60% to 80%, eliminating session-aware routing within a cluster; (3) speculative decoding with NVIDIA's EAGLE-3 drafter model on Kimi K2.5 — shines on agentic workloads because tool calls and JSON-wrapped structured output are highly predictable; (4) Infire — Cloudflare's proprietary Rust inference engine — now with tensor- / pipeline- / expert-parallel multi-GPU support, lower activation memory than vLLM (Llama 4 Scout on 2× H200 with >56 GiB free for KV = 1.2M tokens; Kimi K2.5 on 8× H100 with >30 GiB free for KV), sub-20s cold boot, 20% higher tok/s on unconstrained systems. Workloads are tuned for agentic traffic shape: large system prompt + tools + MCPs + growing turn history = input-heavy with long reusable prefixes; emphasis is fast input-token processing + fast tool-call generation.

Key takeaways¶

Agentic inference is input-heavy — tune the hardware accordingly. "With agents, you send in a large number of input tokens. It starts off with a large system prompt, all the tools, MCPs. With the first user prompt, that context keeps growing. Each new prompt from the user sends a request to the model, which consists of everything that was said before." Workers AI sized its serving stack for fast input-token processing + fast tool calling, not for output-token generation. The generalisation: workload shape (prompt-heavy vs generation-heavy) dictates which GPU resources (compute vs memory) bottleneck, so the configuration choice is not a single global optimum. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Prefill/decode disaggregation is the structural answer to the compute-bound-vs-memory-bound split. "Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine." Split into separate inference servers: the prefill server processes input tokens and populates the KV cache; the request then goes to the decode server with a reference to transfer that KV cache and begin generation. Lets servers be tuned, scaled, and hardware-matched independently to input-heavy vs output-heavy traffic. Measured effect on Workers AI: p90 TTFT dropped and p90 intertoken latency went from ~100 ms with high variance to 20-30 ms — a 3× improvement — "using the same quantity of GPUs" with request volume simultaneously increasing. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
The PD disaggregation load balancer is non-trivial and does two things stock load balancers don't. "This architecture requires a relatively complex load balancer to achieve. Beyond just routing the requests as described above, it must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens. To complicate matters, different inference servers require different information to initiate the KV cache transfer. We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly." Two structural properties: response rewrite across the prefill/decode boundary (prefill's cached-token metadata has to be re-injected into the decode server's SSE stream), and token-count admission control (not request count — see concepts/token-count-based-batching sibling). (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Client-signalled session affinity is how to not re-prefill a growing conversation. "Since agentic use cases usually have long contexts, we optimize for efficient prompt caching in order to not recompute input tensors on every turn. We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors." Clients (agent harnesses, e.g. OpenCode PR #20744) send a per-session opaque token on every turn; Workers AI routes the request back to the region/replica that pre-filled the prefix on the last turn, so the full conversation history is already in KV cache. "While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens." See concepts/session-affinity-prompt-caching, patterns/session-affinity-header. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Peak input-cache-hit ratio 60% → 80% by onboarding heavy internal users to session affinity. "We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews." The generalisation: small client-side changes in prefix-caching discipline compound into a factor-of-additional-GPUs difference for the operator. "A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Cluster-wide shared KV cache over RDMA eliminates the need for same-cluster session-aware routing. When a model instance spans multiple GPUs (required because KV cache alone exceeds a single GPU's VRAM on long contexts), KV state has to live across GPUs and they have to talk directly. Workers AI uses Moonshot's Mooncake Transfer Engine — "a high-performance data transfer framework. It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU." Paired with LMCache or SGLang HiCache (see systems/lmcache, systems/sglang), the cache is shared cluster-wide: "a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly." Mooncake Store extends the cache from GPU VRAM onto NVMe storage — longer session residency, higher cache-hit ratio, more traffic absorbed at lower GPU count. See concepts/rdma-kv-transfer. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Agentic use cases are unusually well-suited to speculative decoding because tool calls are structurally predictable. "In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." Workers AI uses NVIDIA's EAGLE-3 (nvidia/Kimi-K2.5-Thinking-Eagle3) as the drafter model for Kimi K2.5. Principal tuning lever: N = number of future tokens to draft per verify-pass — see speculative decoding, patterns/draft-verify-inference. Sibling to speculative cascades at the rejection-rule layer. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Infire — Rust proprietary inference engine — is Cloudflare's posture on owning the serving stack. Originally announced at Birthday Week 2025 for edge / distributed inference; this post extends it with multi-GPU support in three modes: pipeline, tensor, and expert — typically combined. "For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible." Activation-memory footprint is lower than vLLM: Llama 4 Scout on 2× H200 with >56 GiB remaining for KV cache (~1.2M-token capacity); Kimi K2.5 on 8× H100 (not H200) with >30 GiB remaining for KV — "In both cases you would have trouble even booting vLLM in the first place." Cold boot under 20 s even for the largest models; disk speed is the binding constraint. Headline throughput: up to 20% higher tok/s on unconstrained systems vs the baseline, and ability to run latest models on lower-end hardware where baseline tools can't fit. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Extra-large models quantified: Kimi K2.5 is >1T parameters = ~560 GB of weights; 8× H100 = 640 GB VRAM is the floor just to hold the weights. "A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window." Naming a concrete floor makes vivid why multi-GPU + cluster-wide KV sharing + low-memory-overhead engine all matter at once — you can't pick one. See concepts/multi-gpu-serving. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Kimi K2.5 was made 3× faster post-launch by retuning for observed traffic shape, not by more hardware. "These models have been the backbone of a lot of the agentic products, harnesses, and tools that we have been launching this week. … After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer's use cases." The configuration knobs are all the dials above (PD balance, session affinity, speculative decoding N, KV-tier split, parallelism axes); the generalisation is that serving-side tuning is a continuous response to traffic-shape drift, not a one-time launch config. Combined with same-GPU-count traffic growth, the 3× figure directly illustrates how PD + caching + EAGLE-3 compound. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Architecture / numbers¶

Model: Kimi K2.5 (Moonshot AI, open-weight); >1T parameters; ~560 GB of weights.
Minimum hardware for weights alone: 8× H100 (80 GB VRAM each) = 640 GB; additional VRAM required for KV cache depending on context length.
Post-PD-disaggregation p90 metrics (same GPU count, higher request volume): TTFT reduced (dropped, graph only — no absolute number); p90 intertoken latency: ~100 ms (high variance) → 20-30 ms = 3× improvement.
Post-launch Kimi K2.5 speed-up: 3× faster attributed to configuration retuning for observed traffic shape.
Input-token cache-hit ratio at peak after x-session-affinity rollout: 60% → 80%.
Infire — Llama 4 Scout: 2× H200 GPUs, >56 GiB remaining for KV cache, sufficient for >1.2M tokens.
Infire — Kimi K2.5: 8× H100 (not H200), >30 GiB remaining for KV cache.
Infire cold boot: <20 s for even the largest models; bound by drive speed.
Infire throughput: +20% tok/s on unconstrained systems vs baseline.
Drafter model for Kimi K2.5 speculative decoding: NVIDIA EAGLE-3 (nvidia/Kimi-K2.5-Thinking-Eagle3).
KV transfer substrate: Mooncake Transfer Engine (Moonshot AI) over NVLink + NVMe-over-Fabric RDMA; paired with LMCache or SGLang HiCache for cluster-wide prefix reuse; Mooncake Store extends cache onto NVMe.
Infire parallelism modes: pipeline, tensor, expert (typically combined).

Caveats / not in post¶

Absolute p90 TTFT numbers are not disclosed — only "p90 TTFT dropped" as a graph claim. Only intertoken latency has a concrete before/after (~100 ms → 20-30 ms).
Actual GPU counts and PD split ratios not disclosed. The "same quantity of GPUs" claim is relative, not absolute; no stated prefill:decode node ratio.
EAGLE-3 acceptance rate not disclosed. "Shines on agentic workloads" is qualitative; no measured tokens-per-verify-pass numbers.
vLLM comparison is sided. Infire-vs-vLLM framing comes from Cloudflare; no third-party benchmark methodology.
No pricing / cached-token discount rate disclosed — only the mechanism (discounted cached tokens as incentive for x-session-affinity adoption).
"20% higher tokens-per-second" lacks baseline definition (vs vLLM? vs earlier Infire?) and workload definition.
Agentic "3× faster" has no baseline/workload specification. It's a traffic-shape-retuning claim, not a pure engineering win.
Mooncake Store NVMe-tier cache-hit ratio not disclosed. Only qualitative "extends cache residency".
Boot time "bounded by drive speed" is un-benchmarked. No storage-substrate details (local NVMe? networked? per-node cache pre-warm?).
Multi-model / per-model configuration matrices not given. How the same infrastructure serves smaller models (e.g. Llama 4 Scout on 2× H200) vs trillion-parameter Kimi on 8× H100 differs, but the per-model tuning is only sketched.
Hiring pitch at the end — post is partially a recruiting piece; does not affect the technical content.

Source¶

sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — same-week companion: what agents retrieve; this post is how the underlying LLM serves them.
sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents — agent-architecture companion that consumes these models.
sources/2026-04-15-cloudflare-introducing-agent-lee — first-party agent whose production load motivates these optimisations.
sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — sibling on speculative decoding's generalised rejection rule; Cloudflare adopts the canonical token-exact form via EAGLE-3.
sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference — sibling on token-count-based batching for inference; complementary admission-control primitive.
systems/workers-ai / systems/kimi-k2-5 / systems/infire / systems/mooncake-transfer-engine / systems/mooncake-store / systems/eagle-3 / systems/lmcache / systems/sglang / systems/vllm
concepts/prefill-decode-disaggregation / concepts/token-aware-load-balancing / concepts/session-affinity-prompt-caching / concepts/rdma-kv-transfer / concepts/tensor-parallelism / concepts/pipeline-parallelism / concepts/expert-parallelism / concepts/multi-gpu-serving / concepts/time-to-first-token / concepts/intertoken-latency
patterns/disaggregated-inference-stages / patterns/kv-aware-routing / patterns/session-affinity-header / patterns/draft-verify-inference
companies/cloudflare