CONCEPT Cited by 2 sources
Tensor parallelism¶
Definition¶
Tensor parallelism is a multi-GPU model-sharding strategy in which individual weight matrices (tensors) within each transformer layer are split across multiple GPUs — each GPU holds a slice of every matrix — and the forward pass is a coordinated computation with all-reduce / all-gather communication between GPUs per layer.
Contrast with pipeline parallelism (different layers on different GPUs) and expert parallelism (different experts on different GPUs in MoE models).
Canonical wiki reference: Cloudflare Infire supports tensor parallelism alongside pipeline + expert parallelism for serving large models like Kimi K2.5. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Why split weight matrices across GPUs¶
Under tensor parallelism, a linear layer y = Wx with W ∈ R^{d_out × d_in} is decomposed — column-split (W = [W_1 | W_2 | ...]) or row-split (W = [W_1; W_2; ...]) — across G GPUs so each GPU computes a slice in parallel. The slices are combined via:
- all-reduce on activations (for partial-sum results)
- all-gather on activations (for split-output results)
This fits two classes of serving constraint:
- Model too large for one GPU's VRAM. A 560 GB model doesn't fit on an 80 GB GPU; the weights must be split.
- Latency-critical decode steps on long contexts. Tensor parallelism parallelises within a forward pass, so per-step latency can drop (at the cost of communication overhead).
The load-bearing communication pattern¶
Every layer requires a communication collective. The cost of that collective (per-step) is the dominant overhead of tensor parallelism:
- Intra-node — NVLink / NVSwitch, hundreds of GB/s, communication well hidden in compute.
- Inter-node — InfiniBand / RoCE, tens of GB/s, harder to hide.
Tensor parallelism is usually kept intra-node in practice — inter-node tensor-parallel spans are rare because the all-reduce latency per layer dominates.
From the source:
"For tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Relationship to the other parallelism axes¶
| Axis | What's split | Communication per forward pass | Typical placement |
|---|---|---|---|
| Tensor parallelism | weight matrices | all-reduce per layer | intra-node (NVLink) |
| Pipeline parallelism | transformer layers | point-to-point between stages | across nodes |
| Expert parallelism | MoE experts | all-to-all for routing | intra- or inter-node |
The axes compose. Common pattern: tensor-parallel within a node (tight NVLink domain), pipeline-parallel across nodes (loose InfiniBand domain), expert-parallel over the cluster for MoE.
Cloudflare's posture in the source:
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Effect on KV cache¶
The KV cache is also per-head / per-layer, so it's naturally sharded by the same tensor-parallel axis — each GPU stores its slice of K/V tensors. Cross-GPU KV-transfer for session migration or PD disaggregation must move all slices coherently; see concepts/rdma-kv-transfer.
Design considerations¶
- Communication collective choice (NCCL, custom all-reduce) load-bearing for latency.
- Sequence-parallel extensions reduce memory and allow longer contexts at the cost of more communication.
- Degree (G = how many GPUs across which tensors are split) is a design variable: larger G = more memory per GPU freed, but more comm overhead.
- Inter-operation with speculative decoding — the drafter is typically co-located with its target, but the parallelism degree may differ.
Caveats¶
- Cloudflare's post is shallow on tensor-parallel internals — stated at the "we do it, it reduces comm" level.
- Per-model TP degree not disclosed.
- Communication algorithm choice not disclosed (NCCL? custom?).
- Kernel / collective fusion (overlap of comm with compute) not discussed.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — Infire supports tensor parallelism; pairs with pipeline parallelism as the default shape.
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — training-side TP: eBay's e-Llama continued-pretraining uses TP as one axis of concepts/3d-parallelism|3D parallelism (DP × TP × PP) under Megatron-LM on 480 H100s; the NVLink intra-node domain (60 nodes × 8 GPUs) is the natural TP scope. Specific TP degree not disclosed.
Related¶
- concepts/pipeline-parallelism / concepts/expert-parallelism / concepts/data-parallelism / concepts/3d-parallelism / concepts/multi-gpu-serving
- concepts/kv-cache / concepts/rdma-kv-transfer
- systems/infire / systems/vllm / systems/workers-ai / systems/megatron-lm / systems/e-llama
- companies/cloudflare / companies/ebay