CONCEPT Cited by 2 sources

Mixture of Experts (MoE / MMoE)¶

Definition¶

Mixture of Experts (MoE) is the neural network architecture pattern where a set of specialist subnetworks ("experts") process inputs, and a gating network routes each input (or each example, or each task) to a subset of experts whose outputs are combined. Under MMoE (Multi-gate Mixture of Experts) — the recsys-specific variant introduced by Ma et al. (Google, 2018) — per-task gates learn independent soft-routing over the experts, letting each task develop its own pattern of expert usage while the experts themselves remain shared.

         (input features)
                │
     ┌──────────┼──────────┐
     ▼          ▼          ▼
  Expert 1   Expert 2   Expert N
     │          │          │
     └──────────┼──────────┘
                │
           (expert outputs)
                │
       ┌────────┼────────┐
       ▼        ▼        ▼
   gate A   gate B   gate C
       │        │        │
    Task A   Task B   Task C

MoE variants differ in:

Routing granularity — per-token (LLMs: Switch Transformer, Mixtral), per-example (recsys), per-task (MMoE).
Sparsity — top-1, top-k, or dense (all experts used with soft weights).
Gate architecture — softmax, noisy top-k, learned vs heuristic.

Canonical wiki instance — Pinterest ads engagement model¶

Pinterest uses MMoE as one of the key architectural elements in the shared trunk of its unified ads engagement model (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]."

Footnote [1] references Pinterest's prior post Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development — not ingested on the wiki.

The load-bearing Pinterest finding: MMoE only delivered consistent gains when integrated into the unified model trained on multi-surface data. Applying MMoE to a single surface in isolation produced "no consistent gains or unfavorable cost-gain trade-off". The interpretation: MMoE's expert specialisation needs heterogeneous inputs to justify its cost — surface-specific training data is too homogeneous to exploit the expert routing effectively, but combined multi-surface data gives the experts meaningful sub-populations to specialise over.

Why MMoE in recsys¶

Expert specialisation without task interference. Classical MTL with shared bottom layers suffers when tasks have conflicting gradients. MMoE routes tasks to task-weighted combinations of experts, letting each task "pick" the subset of shared capacity that helps it.
Soft specialisation. Each task learns a soft-routing gate — not a hard task-to-expert mapping. Experts can be shared flexibly across tasks.
Cost scales sub-linearly. Experts are shared (total parameters ≈ N × expert_size), but each task uses all experts with learned weights — typically dense, not sparse at the task gate.

MMoE vs the LLM MoE variants¶

Per-task gates (MMoE) — recsys/ranking default. Gates select expert combinations per task.
Per-token gates (Switch, GShard, Mixtral) — LLM default. Gates select top-k experts per input token, typically sparse.

Pinterest's MMoE is the per-task variant, aligned with the multi-task ads-ranking use case (HF CTR + SR CTR as separate tasks).

Caveats¶

Pinterest doesn't disclose expert count, expert capacity, gate architecture, dense vs sparse routing, or the knowledge-distillation mechanism hinted at in the referenced prior post's title.
MMoE in isolation didn't work per the Pinterest post — the unified-model context (multi-surface data + long sequences + surface-specific tower trees + surface-specific calibration) is load-bearing.
Distinct from LLM-era MoE. The sparse per-token MoE under Mixtral / Switch Transformer is a different deployment shape; the cost analysis and failure modes don't transfer directly.

Seen in¶

2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: MMoE in the shared trunk of a unified multi-surface ads ranking model; didn't pay off in isolation, did pay off in the unified context.
2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1 (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — canonical for the frontier-LLM MoE landscape at end-2025: concrete parameter-count disclosures for GPT-4 (8 × 220B per the 2023 George Hotz leak), Gemini (MoE since 1.5, Feb 2024), Grok (MoE since Grok-1), with Anthropic Claude as the Dense Transformer holdout. Framed in the context of the batch-training boundary — dense or MoE, all frontier LLMs today are offline-batch trained.

Frontier-LLM MoE landscape (Redpanda 2026-01-13)¶

By end-2025, MoE is the dominant frontier-LLM architectural shape. Peter Corless's Redpanda post enumerates the industry:

OpenAI GPT-4 — "leaked by George Hotz (geohotz) in 2023 that OpenAI's GPT-4 was actually not a single 1.76 trillion parameter model, but 8 × 220 billion parameter models running in parallel." Per-token top-k routing (specifics undisclosed). GPT-5 / GPT-5.1 presumed same shape.
Google Gemini — "has been an MoE since 1.5" (Feb 2024).
xAI Grok — "has been an MoE since Grok-1."
Anthropic Claude — the named dense-transformer holdout: "Anthropic Claude remains a single model, known as a Dense Transformer."

Not covered in the Corless post: DeepSeek-MoE, Mistral's Mixtral, Qwen-MoE, Meta's Llama-MoE variants — other widely-deployed MoE LLMs that belong in the same category but aren't named.

Shared constraint across dense and MoE frontier models: both are offline-batch pre-trained. The Corless post's thesis is that this — not the dense-vs-sparse architectural choice — is the load-bearing limitation for the next wave of AI capability (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls).

concepts/multi-task-learning — MMoE is a specific MTL architecture.
concepts/multi-task-multi-label-ranking — the broader MTML ranking framing MMoE fits into.
concepts/dense-transformer — the frontier-LLM foil architectural shape (Claude).
concepts/frontier-model-batch-training-boundary — the shared training-shape limitation of dense and MoE frontier LLMs.
systems/transformer — the architecture primitive.
systems/pinterest-ads-engagement-model
companies/pinterest
companies/redpanda — canonical source of the frontier-LLM MoE-landscape disclosure.