CONCEPT Cited by 1 source
Multi-GPU serving (LLM)¶
Definition¶
Multi-GPU serving is the LLM-inference regime in which a single model instance spans multiple GPUs because the model's weights + working-set memory exceed a single GPU's VRAM. The shape splits weights and computation across GPUs using one or more of tensor, pipeline, or (for MoE) expert parallelism, and requires a high-bandwidth interconnect (NVLink intra-node, InfiniBand / RoCE inter-node) for per-step inter-GPU communication.
Canonical wiki instance: Cloudflare Workers AI 2026-04-16 on Kimi K2.5 — >1T parameters = ~560 GB of weights on 8× H100 (80 GB VRAM each) = 640 GB minimum just for weights, leaving <80 GB cluster-wide for KV cache under naive management. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
The forcing-function¶
From the source:
"A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that's not even including the extra VRAM you would need for KV Cache, which includes your context window."
Three memory consumers compete for VRAM:
- Model weights — fixed per model, large.
- KV cache — scales with (requests × context length × model params); can easily exceed weights at long contexts.
- Activations / internal state — engine-dependent; the thing Infire optimises aggressively for vs baseline.
All three have to fit. The memory equation per-GPU looks like:
Where G is the multi-GPU split degree. Multi-GPU serving shows up whenever the RHS exceeds a single GPU's VRAM.
Three parallelism axes and their composition¶
| Axis | What's split | Comm pattern | Typical placement |
|---|---|---|---|
| Tensor parallelism | weight matrices | all-reduce per layer | intra-node (NVLink) |
| Pipeline parallelism | transformer layers | point-to-point between stages | inter-node |
| Expert parallelism | MoE experts | all-to-all per MoE layer | intra- or inter-node |
The axes compose. A common shape: tensor-parallel within a node (tight NVLink domain, low-latency all-reduce) + pipeline-parallel across nodes (looser InfiniBand domain, point-to-point suffices) + expert-parallel across the cluster for MoE layers.
Cloudflare's default:
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
KV cache coherence across GPUs¶
With the model sharded, the KV cache is sharded along the same axes — each GPU holds its slice. For any operation that crosses the GPU boundary (session migration, PD disaggregation hand-off, cluster-wide prefix reuse), KV slices have to move coherently. The substrate is RDMA-based transfer (Mooncake Transfer Engine in Cloudflare's stack).
Activation-memory-overhead is the serving engine's differentiator¶
Beyond the model weights + KV cache, inference engines consume activation memory (intermediate tensors, attention scratch, communication buffers, etc.). On long contexts with many concurrent requests, this can run into tens of GB. The engine's memory-overhead discipline is load-bearing: more-efficient engines free more VRAM for KV cache, which directly buys more concurrent sessions or longer context.
Cloudflare's Infire-vs-vLLM framing in the post: - Infire: Llama 4 Scout on 2× H200, >56 GiB for KV (~1.2M-token capacity); Kimi K2.5 on 8× H100, >30 GiB for KV. - vLLM: "In both cases you would have trouble even booting vLLM in the first place."
(Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Boot time¶
Multi-GPU serving engines have non-trivial cold-start — weights must be loaded to every GPU, sharded, and parallelism structure established. Infire claims <20 s for the largest models, bounded by drive speed.
Design considerations¶
- Parallelism-degree choice per model per hardware.
- Interconnect topology — NVLink-rich topologies vs InfiniBand-rich topologies dictate axis-to-placement choice.
- Multi-tenant behaviour — one multi-GPU model instance burns its whole GPU set even for a single request; autoscaling granularity is the multi-GPU group, not the GPU.
- Failure-mode behaviour — loss of one GPU in a multi-GPU instance typically takes the instance down.
Caveats¶
- Cloudflare post focuses on the happy path — no discussion of partial-GPU-failure recovery, hot-swap, or multi-tenancy on shared multi-GPU instances.
- Per-model parallelism matrices not fully disclosed.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — Kimi K2.5 on 8× H100 as the canonical example; Infire's multi-GPU support and parallelism composition discussed at overview level.