Skip to content

CONCEPT Cited by 1 source

Multi-GPU serving (LLM)

Definition

Multi-GPU serving is the LLM-inference regime in which a single model instance spans multiple GPUs because the model's weights + working-set memory exceed a single GPU's VRAM. The shape splits weights and computation across GPUs using one or more of tensor, pipeline, or (for MoE) expert parallelism, and requires a high-bandwidth interconnect (NVLink intra-node, InfiniBand / RoCE inter-node) for per-step inter-GPU communication.

Canonical wiki instance: Cloudflare Workers AI 2026-04-16 on Kimi K2.5>1T parameters = ~560 GB of weights on 8× H100 (80 GB VRAM each) = 640 GB minimum just for weights, leaving <80 GB cluster-wide for KV cache under naive management. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

The forcing-function

From the source:

"A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window."

Three memory consumers compete for VRAM:

  1. Model weights — fixed per model, large.
  2. KV cache — scales with (requests × context length × model params); can easily exceed weights at long contexts.
  3. Activations / internal state — engine-dependent; the thing Infire optimises aggressively for vs baseline.

All three have to fit. The memory equation per-GPU looks like:

VRAM_per_GPU ≥ (weights_total / G) + (KV_total / G) + activations + slack

Where G is the multi-GPU split degree. Multi-GPU serving shows up whenever the RHS exceeds a single GPU's VRAM.

Three parallelism axes and their composition

Axis What's split Comm pattern Typical placement
Tensor parallelism weight matrices all-reduce per layer intra-node (NVLink)
Pipeline parallelism transformer layers point-to-point between stages inter-node
Expert parallelism MoE experts all-to-all per MoE layer intra- or inter-node

The axes compose. A common shape: tensor-parallel within a node (tight NVLink domain, low-latency all-reduce) + pipeline-parallel across nodes (looser InfiniBand domain, point-to-point suffices) + expert-parallel across the cluster for MoE layers.

Cloudflare's default:

"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

KV cache coherence across GPUs

With the model sharded, the KV cache is sharded along the same axes — each GPU holds its slice. For any operation that crosses the GPU boundary (session migration, PD disaggregation hand-off, cluster-wide prefix reuse), KV slices have to move coherently. The substrate is RDMA-based transfer (Mooncake Transfer Engine in Cloudflare's stack).

Activation-memory-overhead is the serving engine's differentiator

Beyond the model weights + KV cache, inference engines consume activation memory (intermediate tensors, attention scratch, communication buffers, etc.). On long contexts with many concurrent requests, this can run into tens of GB. The engine's memory-overhead discipline is load-bearing: more-efficient engines free more VRAM for KV cache, which directly buys more concurrent sessions or longer context.

Cloudflare's Infire-vs-vLLM framing in the post: - Infire: Llama 4 Scout on 2× H200, >56 GiB for KV (~1.2M-token capacity); Kimi K2.5 on 8× H100, >30 GiB for KV. - vLLM: "In both cases you would have trouble even booting vLLM in the first place."

(Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Boot time

Multi-GPU serving engines have non-trivial cold-start — weights must be loaded to every GPU, sharded, and parallelism structure established. Infire claims <20 s for the largest models, bounded by drive speed.

Design considerations

  • Parallelism-degree choice per model per hardware.
  • Interconnect topology — NVLink-rich topologies vs InfiniBand-rich topologies dictate axis-to-placement choice.
  • Multi-tenant behaviour — one multi-GPU model instance burns its whole GPU set even for a single request; autoscaling granularity is the multi-GPU group, not the GPU.
  • Failure-mode behaviour — loss of one GPU in a multi-GPU instance typically takes the instance down.

Caveats

  • Cloudflare post focuses on the happy path — no discussion of partial-GPU-failure recovery, hot-swap, or multi-tenancy on shared multi-GPU instances.
  • Per-model parallelism matrices not fully disclosed.

Seen in

Last updated · 200 distilled / 1,178 read