SYSTEM Cited by 2 sources

Kimi K2.5¶

Overview¶

Kimi K2.5 is Moonshot AI's large language model — >1 trillion parameters (~560 GB of weights) — served by Workers AI as @cf/moonshotai/kimi-k2.5. Canonical wiki instance of an extra-large model on Cloudflare's Developer Platform; drives several of the stack's multi-GPU / KV-sharing / speculative-decoding decisions.

The 2026-04-16 AI Search launch post uses Kimi K2.5 as the chat LLM in the support-agent worked example:

const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
  model: workersai("@cf/moonshotai/kimi-k2.5"),
  system: `You are a support agent. Use search_knowledge_base …`,
  …
});

Size and hardware footprint¶

From the 2026-04-16 high-performance-LLMs post:

"Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window."

Minimum hardware to run Kimi K2.5:

	Value
Parameters	>1T
Weights	~560 GB
Per-H100 VRAM	80 GB
Minimum GPU count (weights only)	8× H100 = 640 GB
Additional VRAM needed	for KV cache (scales with context length)

Using Infire, Cloudflare fits Kimi K2.5 on 8× H100 (not H200) with >30 GiB free for KV cache after all overheads. "In both cases you would have trouble even booting vLLM in the first place." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Serving architecture around Kimi K2.5¶

Cloudflare's architecture for serving Kimi K2.5 combines multiple primitives (all in Workers AI):

Multi-GPU serving — sharded across 8× H100 using pipeline + tensor parallelism (possibly expert depending on whether Kimi is MoE — post doesn't explicitly say).
PD disaggregation — prefill and decode on separate node tiers.
Mooncake Transfer Engine + Mooncake Store — Moonshot's own KV-transfer + NVMe-tier infrastructure; used cluster-wide to share KV cache across GPUs and nodes over RDMA.
x-session-affinity — client-signalled routing for cross-cluster / cross-region prefix cache reuse.
Speculative decoding with NVIDIA EAGLE-3 — drafter model nvidia/Kimi-K2.5-Thinking-Eagle3 verifies N draft tokens per expert forward pass; shines on agentic tool-call generation.
Infire — Cloudflare's proprietary Rust inference engine runs Kimi K2.5 with lower memory overhead than vLLM.

Post-launch speed-up¶

"We announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot's Kimi K2.5. Since then, we've made Kimi K2.5 3x faster and have more model additions in-flight."

3× faster post-launch via configuration retuning for observed agentic traffic shape, not by adding more hardware. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Role¶

Third-party open-weight LLM hosted on Cloudflare's Workers AI inference platform — the canonical extra-large model in the Cloudflare stack at time of writing, used in:

AI Search support-agent worked example (chat LLM).
Workers AI high-performance serving deep dive (canonical extra-large model example).

No model-architecture, parameter-count-breakdown, or training details beyond "over 1 trillion parameters" + "about 560GB of model weights" are disclosed in the raw posts.

Open source lineage¶

From Moonshot AI; weights released under an open-weight license. The same company developed the Mooncake Transfer Engine + Mooncake Store KV infrastructure (github.com/kvcache-ai/Mooncake) that Cloudflare consumes externally — Moonshot's serving infrastructure ships with the model.

Seen in¶

sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — chat LLM in support-agent worked example.
sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical extra-large model example in Cloudflare's deep dive on LLM serving infrastructure; all numbers above.

systems/workers-ai — serving platform.
systems/infire — inference engine.
systems/mooncake-transfer-engine / systems/mooncake-store — KV substrate (from same origin).
systems/eagle-3 — speculative-decoding drafter paired with Kimi K2.5.
systems/cloudflare-agents-sdk — canonical consumer.
systems/cloudflare-ai-search — paired retrieval primitive in the support-agent worked example.
concepts/multi-gpu-serving / concepts/speculative-decoding / concepts/prefill-decode-disaggregation
companies/cloudflare — serving platform's parent org.