CONCEPT Cited by 2 sources
Mixture of Experts (MoE / MMoE)¶
Definition¶
Mixture of Experts (MoE) is the neural network architecture pattern where a set of specialist subnetworks ("experts") process inputs, and a gating network routes each input (or each example, or each task) to a subset of experts whose outputs are combined. Under MMoE (Multi-gate Mixture of Experts) — the recsys-specific variant introduced by Ma et al. (Google, 2018) — per-task gates learn independent soft-routing over the experts, letting each task develop its own pattern of expert usage while the experts themselves remain shared.
(input features)
│
┌──────────┼──────────┐
▼ ▼ ▼
Expert 1 Expert 2 Expert N
│ │ │
└──────────┼──────────┘
│
(expert outputs)
│
┌────────┼────────┐
▼ ▼ ▼
gate A gate B gate C
│ │ │
Task A Task B Task C
MoE variants differ in:
- Routing granularity — per-token (LLMs: Switch Transformer, Mixtral), per-example (recsys), per-task (MMoE).
- Sparsity — top-1, top-k, or dense (all experts used with soft weights).
- Gate architecture — softmax, noisy top-k, learned vs heuristic.
Canonical wiki instance — Pinterest ads engagement model¶
Pinterest uses MMoE as one of the key architectural elements in the shared trunk of its unified ads engagement model (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]."
Footnote [1] references Pinterest's prior post Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development — not ingested on the wiki.
The load-bearing Pinterest finding: MMoE only delivered consistent gains when integrated into the unified model trained on multi-surface data. Applying MMoE to a single surface in isolation produced "no consistent gains or unfavorable cost-gain trade-off". The interpretation: MMoE's expert specialisation needs heterogeneous inputs to justify its cost — surface-specific training data is too homogeneous to exploit the expert routing effectively, but combined multi-surface data gives the experts meaningful sub-populations to specialise over.
Why MMoE in recsys¶
- Expert specialisation without task interference. Classical MTL with shared bottom layers suffers when tasks have conflicting gradients. MMoE routes tasks to task-weighted combinations of experts, letting each task "pick" the subset of shared capacity that helps it.
- Soft specialisation. Each task learns a soft-routing gate — not a hard task-to-expert mapping. Experts can be shared flexibly across tasks.
- Cost scales sub-linearly. Experts are shared (total parameters ≈ N × expert_size), but each task uses all experts with learned weights — typically dense, not sparse at the task gate.
MMoE vs the LLM MoE variants¶
- Per-task gates (MMoE) — recsys/ranking default. Gates select expert combinations per task.
- Per-token gates (Switch, GShard, Mixtral) — LLM default. Gates select top-k experts per input token, typically sparse.
Pinterest's MMoE is the per-task variant, aligned with the multi-task ads-ranking use case (HF CTR + SR CTR as separate tasks).
Caveats¶
- Pinterest doesn't disclose expert count, expert capacity, gate architecture, dense vs sparse routing, or the knowledge-distillation mechanism hinted at in the referenced prior post's title.
- MMoE in isolation didn't work per the Pinterest post — the unified-model context (multi-surface data + long sequences + surface-specific tower trees + surface-specific calibration) is load-bearing.
- Distinct from LLM-era MoE. The sparse per-token MoE under Mixtral / Switch Transformer is a different deployment shape; the cost analysis and failure modes don't transfer directly.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: MMoE in the shared trunk of a unified multi-surface ads ranking model; didn't pay off in isolation, did pay off in the unified context.
- 2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1 (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — canonical for the frontier-LLM MoE landscape at end-2025: concrete parameter-count disclosures for GPT-4 (8 × 220B per the 2023 George Hotz leak), Gemini (MoE since 1.5, Feb 2024), Grok (MoE since Grok-1), with Anthropic Claude as the Dense Transformer holdout. Framed in the context of the batch-training boundary — dense or MoE, all frontier LLMs today are offline-batch trained.
Frontier-LLM MoE landscape (Redpanda 2026-01-13)¶
By end-2025, MoE is the dominant frontier-LLM architectural shape. Peter Corless's Redpanda post enumerates the industry:
- OpenAI GPT-4 — "leaked by George Hotz (geohotz) in 2023 that OpenAI's GPT-4 was actually not a single 1.76 trillion parameter model, but 8 × 220 billion parameter models running in parallel." Per-token top-k routing (specifics undisclosed). GPT-5 / GPT-5.1 presumed same shape.
- Google Gemini — "has been an MoE since 1.5" (Feb 2024).
- xAI Grok — "has been an MoE since Grok-1."
- Anthropic Claude — the named dense-transformer holdout: "Anthropic Claude remains a single model, known as a Dense Transformer."
Not covered in the Corless post: DeepSeek-MoE, Mistral's Mixtral, Qwen-MoE, Meta's Llama-MoE variants — other widely-deployed MoE LLMs that belong in the same category but aren't named.
Shared constraint across dense and MoE frontier models: both are offline-batch pre-trained. The Corless post's thesis is that this — not the dense-vs-sparse architectural choice — is the load-bearing limitation for the next wave of AI capability (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls).
Related¶
- concepts/multi-task-learning — MMoE is a specific MTL architecture.
- concepts/multi-task-multi-label-ranking — the broader MTML ranking framing MMoE fits into.
- concepts/dense-transformer — the frontier-LLM foil architectural shape (Claude).
- concepts/frontier-model-batch-training-boundary — the shared training-shape limitation of dense and MoE frontier LLMs.
- systems/transformer — the architecture primitive.
- systems/pinterest-ads-engagement-model
- companies/pinterest
- companies/redpanda — canonical source of the frontier-LLM MoE-landscape disclosure.