Skip to content

SYSTEM Cited by 2 sources

Kimi K2.5

Overview

Kimi K2.5 is Moonshot AI's large language model — >1 trillion parameters (~560 GB of weights) — served by Workers AI as @cf/moonshotai/kimi-k2.5. Canonical wiki instance of an extra-large model on Cloudflare's Developer Platform; drives several of the stack's multi-GPU / KV-sharing / speculative-decoding decisions.

The 2026-04-16 AI Search launch post uses Kimi K2.5 as the chat LLM in the support-agent worked example:

const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
  model: workersai("@cf/moonshotai/kimi-k2.5"),
  system: `You are a support agent. Use search_knowledge_base …`,
  
});

Size and hardware footprint

From the 2026-04-16 high-performance-LLMs post:

"Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window."

Minimum hardware to run Kimi K2.5:

Value
Parameters >1T
Weights ~560 GB
Per-H100 VRAM 80 GB
Minimum GPU count (weights only) 8× H100 = 640 GB
Additional VRAM needed for KV cache (scales with context length)

Using Infire, Cloudflare fits Kimi K2.5 on 8× H100 (not H200) with >30 GiB free for KV cache after all overheads. "In both cases you would have trouble even booting vLLM in the first place." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Serving architecture around Kimi K2.5

Cloudflare's architecture for serving Kimi K2.5 combines multiple primitives (all in Workers AI):

  1. Multi-GPU serving — sharded across 8× H100 using pipeline + tensor parallelism (possibly expert depending on whether Kimi is MoE — post doesn't explicitly say).
  2. PD disaggregation — prefill and decode on separate node tiers.
  3. Mooncake Transfer Engine + Mooncake Store — Moonshot's own KV-transfer + NVMe-tier infrastructure; used cluster-wide to share KV cache across GPUs and nodes over RDMA.
  4. x-session-affinity — client-signalled routing for cross-cluster / cross-region prefix cache reuse.
  5. Speculative decoding with NVIDIA EAGLE-3 — drafter model nvidia/Kimi-K2.5-Thinking-Eagle3 verifies N draft tokens per expert forward pass; shines on agentic tool-call generation.
  6. Infire — Cloudflare's proprietary Rust inference engine runs Kimi K2.5 with lower memory overhead than vLLM.

Post-launch speed-up

"We announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot's Kimi K2.5. Since then, we've made Kimi K2.5 3x faster and have more model additions in-flight."

3× faster post-launch via configuration retuning for observed agentic traffic shape, not by adding more hardware. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Role

Third-party open-weight LLM hosted on Cloudflare's Workers AI inference platform — the canonical extra-large model in the Cloudflare stack at time of writing, used in:

No model-architecture, parameter-count-breakdown, or training details beyond "over 1 trillion parameters" + "about 560GB of model weights" are disclosed in the raw posts.

Open source lineage

From Moonshot AI; weights released under an open-weight license. The same company developed the Mooncake Transfer Engine + Mooncake Store KV infrastructure (github.com/kvcache-ai/Mooncake) that Cloudflare consumes externally — Moonshot's serving infrastructure ships with the model.

Seen in

Last updated · 200 distilled / 1,178 read