Skip to content

CONCEPT Cited by 2 sources

Mixture of Experts (MoE / MMoE)

Definition

Mixture of Experts (MoE) is the neural network architecture pattern where a set of specialist subnetworks ("experts") process inputs, and a gating network routes each input (or each example, or each task) to a subset of experts whose outputs are combined. Under MMoE (Multi-gate Mixture of Experts) — the recsys-specific variant introduced by Ma et al. (Google, 2018) — per-task gates learn independent soft-routing over the experts, letting each task develop its own pattern of expert usage while the experts themselves remain shared.

         (input features)
     ┌──────────┼──────────┐
     ▼          ▼          ▼
  Expert 1   Expert 2   Expert N
     │          │          │
     └──────────┼──────────┘
           (expert outputs)
       ┌────────┼────────┐
       ▼        ▼        ▼
   gate A   gate B   gate C
       │        │        │
    Task A   Task B   Task C

MoE variants differ in:

  • Routing granularity — per-token (LLMs: Switch Transformer, Mixtral), per-example (recsys), per-task (MMoE).
  • Sparsity — top-1, top-k, or dense (all experts used with soft weights).
  • Gate architecture — softmax, noisy top-k, learned vs heuristic.

Canonical wiki instance — Pinterest ads engagement model

Pinterest uses MMoE as one of the key architectural elements in the shared trunk of its unified ads engagement model (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]."

Footnote [1] references Pinterest's prior post Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development — not ingested on the wiki.

The load-bearing Pinterest finding: MMoE only delivered consistent gains when integrated into the unified model trained on multi-surface data. Applying MMoE to a single surface in isolation produced "no consistent gains or unfavorable cost-gain trade-off". The interpretation: MMoE's expert specialisation needs heterogeneous inputs to justify its cost — surface-specific training data is too homogeneous to exploit the expert routing effectively, but combined multi-surface data gives the experts meaningful sub-populations to specialise over.

Why MMoE in recsys

  • Expert specialisation without task interference. Classical MTL with shared bottom layers suffers when tasks have conflicting gradients. MMoE routes tasks to task-weighted combinations of experts, letting each task "pick" the subset of shared capacity that helps it.
  • Soft specialisation. Each task learns a soft-routing gate — not a hard task-to-expert mapping. Experts can be shared flexibly across tasks.
  • Cost scales sub-linearly. Experts are shared (total parameters ≈ N × expert_size), but each task uses all experts with learned weights — typically dense, not sparse at the task gate.

MMoE vs the LLM MoE variants

  • Per-task gates (MMoE) — recsys/ranking default. Gates select expert combinations per task.
  • Per-token gates (Switch, GShard, Mixtral) — LLM default. Gates select top-k experts per input token, typically sparse.

Pinterest's MMoE is the per-task variant, aligned with the multi-task ads-ranking use case (HF CTR + SR CTR as separate tasks).

Caveats

  • Pinterest doesn't disclose expert count, expert capacity, gate architecture, dense vs sparse routing, or the knowledge-distillation mechanism hinted at in the referenced prior post's title.
  • MMoE in isolation didn't work per the Pinterest post — the unified-model context (multi-surface data + long sequences + surface-specific tower trees + surface-specific calibration) is load-bearing.
  • Distinct from LLM-era MoE. The sparse per-token MoE under Mixtral / Switch Transformer is a different deployment shape; the cost analysis and failure modes don't transfer directly.

Seen in

Frontier-LLM MoE landscape (Redpanda 2026-01-13)

By end-2025, MoE is the dominant frontier-LLM architectural shape. Peter Corless's Redpanda post enumerates the industry:

  • OpenAI GPT-4"leaked by George Hotz (geohotz) in 2023 that OpenAI's GPT-4 was actually not a single 1.76 trillion parameter model, but 8 × 220 billion parameter models running in parallel." Per-token top-k routing (specifics undisclosed). GPT-5 / GPT-5.1 presumed same shape.
  • Google Gemini"has been an MoE since 1.5" (Feb 2024).
  • xAI Grok"has been an MoE since Grok-1."
  • Anthropic Claude — the named dense-transformer holdout: "Anthropic Claude remains a single model, known as a Dense Transformer."

Not covered in the Corless post: DeepSeek-MoE, Mistral's Mixtral, Qwen-MoE, Meta's Llama-MoE variants — other widely-deployed MoE LLMs that belong in the same category but aren't named.

Shared constraint across dense and MoE frontier models: both are offline-batch pre-trained. The Corless post's thesis is that this — not the dense-vs-sparse architectural choice — is the load-bearing limitation for the next wave of AI capability (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls).

Last updated · 542 distilled / 1,571 read