CONCEPT Cited by 2 sources

Tensor parallelism¶

Definition¶

Tensor parallelism is a multi-GPU model-sharding strategy in which individual weight matrices (tensors) within each transformer layer are split across multiple GPUs — each GPU holds a slice of every matrix — and the forward pass is a coordinated computation with all-reduce / all-gather communication between GPUs per layer.

Contrast with pipeline parallelism (different layers on different GPUs) and expert parallelism (different experts on different GPUs in MoE models).

Canonical wiki reference: Cloudflare Infire supports tensor parallelism alongside pipeline + expert parallelism for serving large models like Kimi K2.5. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why split weight matrices across GPUs¶

Under tensor parallelism, a linear layer y = Wx with W ∈ R^{d_out × d_in} is decomposed — column-split (W = [W_1 | W_2 | ...]) or row-split (W = [W_1; W_2; ...]) — across G GPUs so each GPU computes a slice in parallel. The slices are combined via:

all-reduce on activations (for partial-sum results)
all-gather on activations (for split-output results)

This fits two classes of serving constraint:

Model too large for one GPU's VRAM. A 560 GB model doesn't fit on an 80 GB GPU; the weights must be split.
Latency-critical decode steps on long contexts. Tensor parallelism parallelises within a forward pass, so per-step latency can drop (at the cost of communication overhead).

The load-bearing communication pattern¶

Every layer requires a communication collective. The cost of that collective (per-step) is the dominant overhead of tensor parallelism:

Intra-node — NVLink / NVSwitch, hundreds of GB/s, communication well hidden in compute.
Inter-node — InfiniBand / RoCE, tens of GB/s, harder to hide.

Tensor parallelism is usually kept intra-node in practice — inter-node tensor-parallel spans are rare because the all-reduce latency per layer dominates.

From the source:

"For tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Relationship to the other parallelism axes¶

Axis	What's split	Communication per forward pass	Typical placement
Tensor parallelism	weight matrices	all-reduce per layer	intra-node (NVLink)
Pipeline parallelism	transformer layers	point-to-point between stages	across nodes
Expert parallelism	MoE experts	all-to-all for routing	intra- or inter-node

The axes compose. Common pattern: tensor-parallel within a node (tight NVLink domain), pipeline-parallel across nodes (loose InfiniBand domain), expert-parallel over the cluster for MoE.

Cloudflare's posture in the source:

"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Effect on KV cache ¶

The KV cache is also per-head / per-layer, so it's naturally sharded by the same tensor-parallel axis — each GPU stores its slice of K/V tensors. Cross-GPU KV-transfer for session migration or PD disaggregation must move all slices coherently; see concepts/rdma-kv-transfer.

Design considerations¶

Communication collective choice (NCCL, custom all-reduce) load-bearing for latency.
Sequence-parallel extensions reduce memory and allow longer contexts at the cost of more communication.
Degree (G = how many GPUs across which tensors are split) is a design variable: larger G = more memory per GPU freed, but more comm overhead.
Inter-operation with speculative decoding — the drafter is typically co-located with its target, but the parallelism degree may differ.

Caveats¶

Cloudflare's post is shallow on tensor-parallel internals — stated at the "we do it, it reduces comm" level.
Per-model TP degree not disclosed.
Communication algorithm choice not disclosed (NCCL? custom?).
Kernel / collective fusion (overlap of comm with compute) not discussed.

Seen in¶

sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — Infire supports tensor parallelism; pairs with pipeline parallelism as the default shape.
sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — training-side TP: eBay's e-Llama continued-pretraining uses TP as one axis of concepts/3d-parallelism|3D parallelism (DP × TP × PP) under Megatron-LM on 480 H100s; the NVLink intra-node domain (60 nodes × 8 GPUs) is the natural TP scope. Specific TP degree not disclosed.