Skip to content

CONCEPT Cited by 2 sources

Pipeline parallelism

Definition

Pipeline parallelism is a multi-GPU model-sharding strategy in which different transformer layers live on different GPUs — GPU 0 holds layers 1-N/G, GPU 1 holds layers N/G+1 to 2N/G, etc. — and activations flow through the pipeline like a conveyor belt. A forward pass is a staged computation: GPU 0 processes layer 0, hands activations to GPU 1, and so on.

Contrast with tensor parallelism (split a tensor across GPUs) and expert parallelism (different MoE experts on different GPUs).

Canonical wiki reference: Cloudflare Infire supports pipeline parallelism alongside tensor + expert parallelism. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why split by layer

  • Memory capacity. A 560 GB model doesn't fit in one GPU's 80 GB VRAM; split the layer stack evenly and each GPU holds only a fraction.
  • Bandwidth economy vs tensor parallelism. Pipeline stages communicate only at stage boundaries (activation tensors between adjacent stages), not every layer. This makes pipeline parallelism far more tolerant of slower inter-node links (InfiniBand / RoCE) than tensor parallelism.

The pipeline-balance problem

Pipeline parallelism's central pain:

  • If the layers-per-GPU partition is uneven, the slow stage bottlenecks the whole pipeline and faster stages starve waiting on upstream.
  • During startup / drain, stages at the front fill while stages at the back idle (the classic pipeline bubble). The steady-state utilisation bound is (1 - bubble-time) / total.

Cloudflare names this explicitly:

"For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Inference vs training

Pipeline parallelism was first popularised in training (GPipe, PipeDream, Megatron-LM's virtual pipeline) where large micro-batches amortise the bubble cost. In inference — especially interactive agent inference with small per-request batches — the bubble problem is more acute: it's harder to fill the pipeline.

This is the structural reason inference-side tensor parallelism is often preferred at small model scales; pipeline parallelism's appeal grows as the model exceeds tensor-parallel-friendly topologies (i.e. multi-node).

Relationship to the other parallelism axes

Axis What's split Communication per forward pass Typical placement
Pipeline parallelism transformer layers point-to-point between adjacent stages across nodes (tolerant of InfiniBand / RoCE)
Tensor parallelism weight matrices all-reduce per layer intra-node (NVLink)
Expert parallelism MoE experts all-to-all for routing intra- or inter-node

Axes compose: common pattern is tensor-parallel within each pipeline stage (intra-node NVLink), pipeline-parallel across stages (inter-node).

Cloudflare's posture:

"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Effect on KV cache

Each GPU holds KV state for only the layers it owns, so the per-GPU KV cache footprint is (KV_total / G_pipeline_stages) × (requests_in_flight). Pipeline parallelism naturally distributes KV memory pressure — one of the reasons it pairs well with tensor parallelism to maximise KV-cache headroom (Infire's reason for getting Kimi K2.5 onto 8× H100 with >30 GiB free for KV; see systems/infire).

Design considerations

  • Partition strategy — equal-layer-count vs equal-compute-cost-per-partition.
  • Scheduling — 1F1B (one-forward-one-backward) vs interleaved vs GPipe; choice changes bubble time.
  • Microbatching — splitting each batch into smaller chunks that pipeline simultaneously reduces bubble time.
  • Continuous batching interaction — inference-side PP is usually combined with continuous batching so the pipeline never fully drains.

Caveats

  • Cloudflare's post is shallow on PP internals — stated at the "we load-balance stages" level.
  • Partitioning algorithm, scheduling algorithm, microbatch size not disclosed.
  • Per-model PP degree not disclosed.
  • Pipeline-bubble-time numbers not disclosed.

Seen in

Last updated · 200 distilled / 1,178 read