Skip to content

CONCEPT Cited by 2 sources

Tensor parallelism

Definition

Tensor parallelism is a multi-GPU model-sharding strategy in which individual weight matrices (tensors) within each transformer layer are split across multiple GPUs — each GPU holds a slice of every matrix — and the forward pass is a coordinated computation with all-reduce / all-gather communication between GPUs per layer.

Contrast with pipeline parallelism (different layers on different GPUs) and expert parallelism (different experts on different GPUs in MoE models).

Canonical wiki reference: Cloudflare Infire supports tensor parallelism alongside pipeline + expert parallelism for serving large models like Kimi K2.5. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why split weight matrices across GPUs

Under tensor parallelism, a linear layer y = Wx with W ∈ R^{d_out × d_in} is decomposed — column-split (W = [W_1 | W_2 | ...]) or row-split (W = [W_1; W_2; ...]) — across G GPUs so each GPU computes a slice in parallel. The slices are combined via:

  • all-reduce on activations (for partial-sum results)
  • all-gather on activations (for split-output results)

This fits two classes of serving constraint:

  1. Model too large for one GPU's VRAM. A 560 GB model doesn't fit on an 80 GB GPU; the weights must be split.
  2. Latency-critical decode steps on long contexts. Tensor parallelism parallelises within a forward pass, so per-step latency can drop (at the cost of communication overhead).

The load-bearing communication pattern

Every layer requires a communication collective. The cost of that collective (per-step) is the dominant overhead of tensor parallelism:

  • Intra-node — NVLink / NVSwitch, hundreds of GB/s, communication well hidden in compute.
  • Inter-node — InfiniBand / RoCE, tens of GB/s, harder to hide.

Tensor parallelism is usually kept intra-node in practice — inter-node tensor-parallel spans are rare because the all-reduce latency per layer dominates.

From the source:

"For tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Relationship to the other parallelism axes

Axis What's split Communication per forward pass Typical placement
Tensor parallelism weight matrices all-reduce per layer intra-node (NVLink)
Pipeline parallelism transformer layers point-to-point between stages across nodes
Expert parallelism MoE experts all-to-all for routing intra- or inter-node

The axes compose. Common pattern: tensor-parallel within a node (tight NVLink domain), pipeline-parallel across nodes (loose InfiniBand domain), expert-parallel over the cluster for MoE.

Cloudflare's posture in the source:

"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Effect on KV cache

The KV cache is also per-head / per-layer, so it's naturally sharded by the same tensor-parallel axis — each GPU stores its slice of K/V tensors. Cross-GPU KV-transfer for session migration or PD disaggregation must move all slices coherently; see concepts/rdma-kv-transfer.

Design considerations

  • Communication collective choice (NCCL, custom all-reduce) load-bearing for latency.
  • Sequence-parallel extensions reduce memory and allow longer contexts at the cost of more communication.
  • Degree (G = how many GPUs across which tensors are split) is a design variable: larger G = more memory per GPU freed, but more comm overhead.
  • Inter-operation with speculative decoding — the drafter is typically co-located with its target, but the parallelism degree may differ.

Caveats

  • Cloudflare's post is shallow on tensor-parallel internals — stated at the "we do it, it reduces comm" level.
  • Per-model TP degree not disclosed.
  • Communication algorithm choice not disclosed (NCCL? custom?).
  • Kernel / collective fusion (overlap of comm with compute) not discussed.

Seen in

Last updated · 200 distilled / 1,178 read