SYSTEM Cited by 2 sources
Kimi K2.5¶
Overview¶
Kimi K2.5 is Moonshot AI's large language model — >1 trillion parameters (~560 GB of weights) — served by Workers AI as @cf/moonshotai/kimi-k2.5. Canonical wiki instance of an extra-large model on Cloudflare's Developer Platform; drives several of the stack's multi-GPU / KV-sharing / speculative-decoding decisions.
The 2026-04-16 AI Search launch post uses Kimi K2.5 as the chat LLM in the support-agent worked example:
const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersai("@cf/moonshotai/kimi-k2.5"),
system: `You are a support agent. Use search_knowledge_base …`,
…
});
Size and hardware footprint¶
From the 2026-04-16 high-performance-LLMs post:
"Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window."
Minimum hardware to run Kimi K2.5:
| Value | |
|---|---|
| Parameters | >1T |
| Weights | ~560 GB |
| Per-H100 VRAM | 80 GB |
| Minimum GPU count (weights only) | 8× H100 = 640 GB |
| Additional VRAM needed | for KV cache (scales with context length) |
Using Infire, Cloudflare fits Kimi K2.5 on 8× H100 (not H200) with >30 GiB free for KV cache after all overheads. "In both cases you would have trouble even booting vLLM in the first place." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Serving architecture around Kimi K2.5¶
Cloudflare's architecture for serving Kimi K2.5 combines multiple primitives (all in Workers AI):
- Multi-GPU serving — sharded across 8× H100 using pipeline + tensor parallelism (possibly expert depending on whether Kimi is MoE — post doesn't explicitly say).
- PD disaggregation — prefill and decode on separate node tiers.
- Mooncake Transfer Engine + Mooncake Store — Moonshot's own KV-transfer + NVMe-tier infrastructure; used cluster-wide to share KV cache across GPUs and nodes over RDMA.
x-session-affinity— client-signalled routing for cross-cluster / cross-region prefix cache reuse.- Speculative decoding with NVIDIA EAGLE-3 — drafter model
nvidia/Kimi-K2.5-Thinking-Eagle3verifies N draft tokens per expert forward pass; shines on agentic tool-call generation. - Infire — Cloudflare's proprietary Rust inference engine runs Kimi K2.5 with lower memory overhead than vLLM.
Post-launch speed-up¶
"We announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot's Kimi K2.5. Since then, we've made Kimi K2.5 3x faster and have more model additions in-flight."
3× faster post-launch via configuration retuning for observed agentic traffic shape, not by adding more hardware. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Role¶
Third-party open-weight LLM hosted on Cloudflare's Workers AI inference platform — the canonical extra-large model in the Cloudflare stack at time of writing, used in:
- AI Search support-agent worked example (chat LLM).
- Workers AI high-performance serving deep dive (canonical extra-large model example).
No model-architecture, parameter-count-breakdown, or training details beyond "over 1 trillion parameters" + "about 560GB of model weights" are disclosed in the raw posts.
Open source lineage¶
From Moonshot AI; weights released under an open-weight license. The same company developed the Mooncake Transfer Engine + Mooncake Store KV infrastructure (github.com/kvcache-ai/Mooncake) that Cloudflare consumes externally — Moonshot's serving infrastructure ships with the model.
Seen in¶
- sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — chat LLM in support-agent worked example.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical extra-large model example in Cloudflare's deep dive on LLM serving infrastructure; all numbers above.
Related¶
- systems/workers-ai — serving platform.
- systems/infire — inference engine.
- systems/mooncake-transfer-engine / systems/mooncake-store — KV substrate (from same origin).
- systems/eagle-3 — speculative-decoding drafter paired with Kimi K2.5.
- systems/cloudflare-agents-sdk — canonical consumer.
- systems/cloudflare-ai-search — paired retrieval primitive in the support-agent worked example.
- concepts/multi-gpu-serving / concepts/speculative-decoding / concepts/prefill-decode-disaggregation
- companies/cloudflare — serving platform's parent org.