SYSTEM Cited by 1 source
Mooncake Transfer Engine¶
Overview¶
Mooncake Transfer Engine (github.com/kvcache-ai/Mooncake) is an open-source high-performance data-transfer framework from Moonshot AI for moving KV-cache blocks between GPUs and nodes in multi-GPU LLM-serving clusters. "It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Problem it solves¶
When one LLM model instance spans multiple GPUs (forced by model weights alone exceeding per-GPU VRAM — e.g. Kimi K2.5 at ~560 GB on 8× H100), the KV cache must live across GPUs and they must exchange KV tensors as generation proceeds and as requests move between replicas (e.g. for PD disaggregation, cluster-wide prefix reuse, or migration to warmer nodes).
Traditional socket-based transfer requires: - Read from source GPU → host DRAM (CPU copy), - Host DRAM → NIC → network → NIC → destination host DRAM (CPU copy), - Host DRAM → destination GPU (CPU copy).
Three round-trips through CPU + host DRAM. Mooncake's Transfer Engine replaces this with RDMA so that transfers are direct GPU-memory ↔ GPU-memory without CPU involvement.
Transport layers¶
The transport backends named in the Cloudflare post:
- NVLink — NVIDIA's high-bandwidth GPU↔GPU interconnect (intra-node); hundreds of GB/s bi-directional; several generations (NVLink-C2C, NVSwitch fabric).
- NVMe over Fabrics (NVMe-oF) — remote block protocol using RDMA transport (RoCE, iWARP, InfiniBand); extends block-storage semantics over the network.
The substrate is protocol-agnostic: "It works with different Remote Direct Memory Access (RDMA) protocols". Choice depends on topology (intra-node = NVLink, inter-node = NVMe-oF / InfiniBand / RoCE).
Role in Cloudflare's Workers AI stack¶
"To achieve this for Kimi, we leveraged Moonshot AI's Mooncake Transfer Engine and Mooncake Store." Specifically:
- Moves KV cache between Infire replicas in a cluster when sessions migrate or when prefill → decode hand-off happens.
- Paired with LMCache or SGLang HiCache for cluster-wide prefix reuse: "a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node."
- Unlocks load-balancing-within-cluster that ignores session-affinity hints for same-cluster traffic: "This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly." (Session affinity via
x-session-affinitystill matters across regions / clusters.)
Companion: Mooncake Store¶
The sibling component in the Mooncake stack (kvcache-ai/Mooncake) extends the KV cache beyond GPU VRAM to NVMe, functioning as a cold tier for idle sessions and long-lived prefix caches. "Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users." See systems/mooncake-store.
Related primitives¶
- KV cache — what is being transferred.
- RDMA KV transfer — the transport primitive.
- Prefill/decode disaggregation — one of the named use cases for high-bandwidth KV transfer.
- Multi-GPU serving — the architectural regime Mooncake targets.
Open-source origin¶
From Moonshot AI, the developers of Kimi K2.5. Open-source: github.com/kvcache-ai/Mooncake. Published separately from the Kimi weights; Cloudflare consumes it as an external dependency. See Mooncake paper for the original architecture.
Caveats¶
- Post is thin on operational detail. No numbers for Mooncake-transfer throughput, latency, per-hop overhead, or scale.
- Integration point with Infire not disclosed — whether Infire links Mooncake directly or sits behind a side-car.
- Fabric choice per workload not disclosed — NVLink-only, NVMe-oF-only, or mixed.
- Failure modes not discussed — what happens on RDMA fabric partition, link degradation, or a node losing its NVMe tier.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — introduced as the KV-transfer substrate for multi-GPU Kimi serving on Workers AI.
Related¶
- systems/mooncake-store — sibling NVMe-tier KV cache in the Mooncake stack.
- systems/kimi-k2-5 — Moonshot AI's model served by Cloudflare on this substrate.
- systems/workers-ai / systems/infire — Cloudflare consumers.
- systems/lmcache / systems/sglang — paired with Mooncake for cluster-wide prefix reuse.
- concepts/kv-cache / concepts/rdma-kv-transfer / concepts/multi-gpu-serving / concepts/prefill-decode-disaggregation
- companies/cloudflare