Skip to content

CONCEPT Cited by 1 source

Expert parallelism (MoE serving)

Definition

Expert parallelism is a multi-GPU model-sharding strategy specific to Mixture-of-Experts (MoE) models: different experts (sub-networks activated by a learned router per-token) live on different GPUs. Each token's path through the network is determined by the router; the activation is sent to whichever GPU hosts its selected expert(s), evaluated, and the output is gathered back.

Contrast with tensor parallelism (split individual weight matrices within every layer) and pipeline parallelism (different layers on different GPUs). Expert parallelism is MoE-specific — it's meaningful only because MoE layers have distinct independently-trainable parameter blocks (the experts).

Canonical wiki reference: Cloudflare Infire supports expert parallelism alongside pipeline + tensor. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why MoE needs its own parallelism axis

In MoE models (e.g. Mixtral, DeepSeek-V3, likely Kimi K2.5), each feed-forward layer is replaced with N experts; a router picks K experts per token (typically K ∈ {1, 2}). Experts are independently parameterised — they don't share weights — so:

  • Total parameter count of an MoE layer = N × expert-size. For large N this is a huge fraction of model weights.
  • Only K/N of the weights activate per token, so the compute per token is a fraction of the parameter count.

This memory-vs-compute mismatch is the defining MoE property. Expert parallelism lets memory scale with (num experts × num GPUs) without linearly scaling compute. MoE serving at scale is impossible without some form of expert parallelism — a single GPU can't hold all experts.

The communication pattern: all-to-all

Each token's activation is dispatched to the GPU(s) that hold its selected expert(s) — a routing decision that depends on the token. The aggregate communication across a batch is an all-to-all exchange: each GPU sends a subset of its tokens to each other GPU, based on the router's per-token choices.

All-to-all is more demanding than the all-reduce/all-gather used by tensor parallelism — bandwidth scales with O(B × G) where B is batch and G is expert-parallel degree, and is sensitive to imbalance in expert selection (hot experts on some GPUs, cold on others).

The load-imbalance hazard

If the router over-selects a small number of experts (unbalanced load), those GPUs become bottlenecks while the cold-expert GPUs idle. Training losses typically add an auxiliary "load-balancing loss" to push the router toward even expert usage; inference-time mitigation depends on the degree of imbalance in production traffic vs training distribution.

Composition with other parallelism axes

Expert parallelism composes with tensor + pipeline:

  • Within each expert: tensor-parallel that expert's weight matrices across GPUs (if expert is itself large).
  • Across pipeline stages: pipeline-parallel across transformer blocks; expert-parallel within each MoE layer.

Cloudflare's posture: pipeline + tensor parallelism is the baseline, expert parallelism is supported for MoE models:

"Since we initially launched Infire, we had to add support for multi-GPU, letting the inference engine run across multiple GPUs in either pipeline-parallel or tensor-parallel modes with expert-parallelism supported as well." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Design considerations

  • Expert placement — how experts map to GPUs (can be static or dynamically rebalanced).
  • Capacity factor — a small slack in how many tokens each expert can process per step, to avoid dropping tokens under imbalance.
  • All-to-all implementation — NCCL, custom kernels, fused with compute; this is the dominant overhead.
  • Routing algorithm — top-K, noisy top-K, soft routing.

Caveats

  • Cloudflare post mentions expert parallelism only in passing — just as an enumeration item alongside pipeline + tensor.
  • No expert-placement algorithm disclosed.
  • No all-to-all implementation detail disclosed.
  • Kimi K2.5's architecture (MoE or dense) is not explicitly disclosed in the post — the mention of expert parallelism is generic and may or may not apply to Kimi specifically.

Seen in

Last updated · 200 distilled / 1,178 read