CONCEPT Cited by 2 sources
Pipeline parallelism¶
Definition¶
Pipeline parallelism is a multi-GPU model-sharding strategy in which different transformer layers live on different GPUs — GPU 0 holds layers 1-N/G, GPU 1 holds layers N/G+1 to 2N/G, etc. — and activations flow through the pipeline like a conveyor belt. A forward pass is a staged computation: GPU 0 processes layer 0, hands activations to GPU 1, and so on.
Contrast with tensor parallelism (split a tensor across GPUs) and expert parallelism (different MoE experts on different GPUs).
Canonical wiki reference: Cloudflare Infire supports pipeline parallelism alongside tensor + expert parallelism. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Why split by layer¶
- Memory capacity. A 560 GB model doesn't fit in one GPU's 80 GB VRAM; split the layer stack evenly and each GPU holds only a fraction.
- Bandwidth economy vs tensor parallelism. Pipeline stages communicate only at stage boundaries (activation tensors between adjacent stages), not every layer. This makes pipeline parallelism far more tolerant of slower inter-node links (InfiniBand / RoCE) than tensor parallelism.
The pipeline-balance problem¶
Pipeline parallelism's central pain:
- If the layers-per-GPU partition is uneven, the slow stage bottlenecks the whole pipeline and faster stages starve waiting on upstream.
- During startup / drain, stages at the front fill while stages at the back idle (the classic pipeline bubble). The steady-state utilisation bound is (1 - bubble-time) / total.
Cloudflare names this explicitly:
"For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Inference vs training¶
Pipeline parallelism was first popularised in training (GPipe, PipeDream, Megatron-LM's virtual pipeline) where large micro-batches amortise the bubble cost. In inference — especially interactive agent inference with small per-request batches — the bubble problem is more acute: it's harder to fill the pipeline.
This is the structural reason inference-side tensor parallelism is often preferred at small model scales; pipeline parallelism's appeal grows as the model exceeds tensor-parallel-friendly topologies (i.e. multi-node).
Relationship to the other parallelism axes¶
| Axis | What's split | Communication per forward pass | Typical placement |
|---|---|---|---|
| Pipeline parallelism | transformer layers | point-to-point between adjacent stages | across nodes (tolerant of InfiniBand / RoCE) |
| Tensor parallelism | weight matrices | all-reduce per layer | intra-node (NVLink) |
| Expert parallelism | MoE experts | all-to-all for routing | intra- or inter-node |
Axes compose: common pattern is tensor-parallel within each pipeline stage (intra-node NVLink), pipeline-parallel across stages (inter-node).
Cloudflare's posture:
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Effect on KV cache¶
Each GPU holds KV state for only the layers it owns, so the per-GPU KV cache footprint is (KV_total / G_pipeline_stages) × (requests_in_flight). Pipeline parallelism naturally distributes KV memory pressure — one of the reasons it pairs well with tensor parallelism to maximise KV-cache headroom (Infire's reason for getting Kimi K2.5 onto 8× H100 with >30 GiB free for KV; see systems/infire).
Design considerations¶
- Partition strategy — equal-layer-count vs equal-compute-cost-per-partition.
- Scheduling — 1F1B (one-forward-one-backward) vs interleaved vs GPipe; choice changes bubble time.
- Microbatching — splitting each batch into smaller chunks that pipeline simultaneously reduces bubble time.
- Continuous batching interaction — inference-side PP is usually combined with continuous batching so the pipeline never fully drains.
Caveats¶
- Cloudflare's post is shallow on PP internals — stated at the "we load-balance stages" level.
- Partitioning algorithm, scheduling algorithm, microbatch size not disclosed.
- Per-model PP degree not disclosed.
- Pipeline-bubble-time numbers not disclosed.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — Infire supports pipeline parallelism, with load-balancing-across-stages named as a key implementation concern.
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — training-side PP: eBay's e-Llama continued-pretraining uses PP as one axis of concepts/3d-parallelism|3D parallelism (DP × TP × PP) under Megatron-LM, spanning groups of nodes across InfiniBand (the slower inter-node fabric) — the canonical hardware-to-axis mapping where PP is the natural fit for inter-node communication. Specific PP degree not disclosed.
Related¶
- concepts/tensor-parallelism / concepts/expert-parallelism / concepts/data-parallelism / concepts/3d-parallelism / concepts/multi-gpu-serving
- concepts/kv-cache
- systems/infire / systems/vllm / systems/workers-ai / systems/megatron-lm / systems/e-llama / systems/infiniband
- companies/cloudflare / companies/ebay