Skip to content

CONCEPT Cited by 1 source

RDMA KV transfer

Definition

RDMA KV transfer is the serving-infrastructure primitive of moving LLM KV cache blocks between GPUs (intra-node) or nodes (inter-node) using Remote Direct Memory Access protocols — direct GPU-memory ↔ GPU-memory transfer without CPU involvement.

Canonical wiki instance: Cloudflare Workers AI uses Mooncake Transfer Engine over NVLink + NVMe over Fabrics RDMA protocols to transfer KV state across multi-GPU instances and across cluster nodes. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why RDMA

The non-RDMA KV transfer path costs three CPU round-trips + host DRAM staging:

  1. Source GPU → source host DRAM (copy engine, CPU-mediated control)
  2. Source host DRAM → NIC → network → dest NIC → dest host DRAM
  3. Dest host DRAM → dest GPU (copy engine)

At LLM serving scale — where KV blocks for a single request can be hundreds of MB and transfers happen per hand-off (prefill→decode migration, session resumption, cluster-wide cache reuse) — that path is prohibitive. RDMA collapses it:

  • GPUDirect RDMA / NVLink — direct GPU↔GPU on the same node (PCIe peer-to-peer or NVSwitch fabric) — bypasses host DRAM and CPU entirely.
  • RDMA over NVMe-oF / RoCE / InfiniBand — inter-node GPU↔GPU via a direct NIC-to-NIC path; still no CPU in the data path.

From the source:

"It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU."

Transport-substrate choices

Transport Scope Physical substrate
NVLink intra-node GPU↔GPU NVIDIA proprietary high-bandwidth link
PCIe P2P intra-node fallback PCIe switch between GPUs
NVMe-oF inter-node over RDMA RDMA transport (RoCE v1/v2, iWARP, or InfiniBand)
InfiniBand inter-node dedicated RDMA fabric
RoCE inter-node RDMA over Ethernet

Mooncake's Transfer Engine is transport-agnostic; the choice per-workload depends on topology.

Why this matters under PD disaggregation

Disaggregation places prefill and decode on different physical machines. Prefill's output is the KV cache, and decode needs it to produce the first token. The time to transfer KV cache inter-node directly adds to TTFT. RDMA is not an optional optimisation here — it's the only transport that keeps the transfer budget under the TTFT budget at long context lengths.

Why this matters under multi-GPU model-sharding

Large models (e.g. Kimi K2.5 at ~560 GB on 8× H100) span multiple GPUs by necessity. Each attention layer's K/V tensor lives split across GPUs, and attention output requires exchanging activations / KV slices per layer. The underlying primitive is NVLink-class RDMA; a CPU-mediated transfer would collapse per-token latency. See concepts/multi-gpu-serving, concepts/tensor-parallelism.

Companion tiering primitive

Mooncake Store extends the RDMA-addressable cache onto NVMe storage — so a KV block can live on local NVMe on any cluster node, and be RDMA-fetched directly from there to a decoding GPU without CPU staging. "Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache."

Caveats

  • Cloudflare post discloses no bandwidth / latency numbers for their KV transfers.
  • Fabric topology per region not disclosed — which clusters use NVLink-only, which use NVMe-oF, which mix.
  • Failure modes not discussed — RDMA fabric partitions, link-degradation scenarios.
  • Costs of the RDMA substrate (high-bandwidth NICs, InfiniBand switches, cable plants) are load-bearing on cluster economics but not addressed in the post.

Seen in

Last updated · 200 distilled / 1,178 read