CONCEPT Cited by 1 source
Mixture of Experts (MoE / MMoE)¶
Definition¶
Mixture of Experts (MoE) is the neural network architecture pattern where a set of specialist subnetworks ("experts") process inputs, and a gating network routes each input (or each example, or each task) to a subset of experts whose outputs are combined. Under MMoE (Multi-gate Mixture of Experts) — the recsys-specific variant introduced by Ma et al. (Google, 2018) — per-task gates learn independent soft-routing over the experts, letting each task develop its own pattern of expert usage while the experts themselves remain shared.
(input features)
│
┌──────────┼──────────┐
▼ ▼ ▼
Expert 1 Expert 2 Expert N
│ │ │
└──────────┼──────────┘
│
(expert outputs)
│
┌────────┼────────┐
▼ ▼ ▼
gate A gate B gate C
│ │ │
Task A Task B Task C
MoE variants differ in:
- Routing granularity — per-token (LLMs: Switch Transformer, Mixtral), per-example (recsys), per-task (MMoE).
- Sparsity — top-1, top-k, or dense (all experts used with soft weights).
- Gate architecture — softmax, noisy top-k, learned vs heuristic.
Canonical wiki instance — Pinterest ads engagement model¶
Pinterest uses MMoE as one of the key architectural elements in the shared trunk of its unified ads engagement model (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]."
Footnote [1] references Pinterest's prior post Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development — not ingested on the wiki.
The load-bearing Pinterest finding: MMoE only delivered consistent gains when integrated into the unified model trained on multi-surface data. Applying MMoE to a single surface in isolation produced "no consistent gains or unfavorable cost-gain trade-off". The interpretation: MMoE's expert specialisation needs heterogeneous inputs to justify its cost — surface-specific training data is too homogeneous to exploit the expert routing effectively, but combined multi-surface data gives the experts meaningful sub-populations to specialise over.
Why MMoE in recsys¶
- Expert specialisation without task interference. Classical MTL with shared bottom layers suffers when tasks have conflicting gradients. MMoE routes tasks to task-weighted combinations of experts, letting each task "pick" the subset of shared capacity that helps it.
- Soft specialisation. Each task learns a soft-routing gate — not a hard task-to-expert mapping. Experts can be shared flexibly across tasks.
- Cost scales sub-linearly. Experts are shared (total parameters ≈ N × expert_size), but each task uses all experts with learned weights — typically dense, not sparse at the task gate.
MMoE vs the LLM MoE variants¶
- Per-task gates (MMoE) — recsys/ranking default. Gates select expert combinations per task.
- Per-token gates (Switch, GShard, Mixtral) — LLM default. Gates select top-k experts per input token, typically sparse.
Pinterest's MMoE is the per-task variant, aligned with the multi-task ads-ranking use case (HF CTR + SR CTR as separate tasks).
Caveats¶
- Pinterest doesn't disclose expert count, expert capacity, gate architecture, dense vs sparse routing, or the knowledge-distillation mechanism hinted at in the referenced prior post's title.
- MMoE in isolation didn't work per the Pinterest post — the unified-model context (multi-surface data + long sequences + surface-specific tower trees + surface-specific calibration) is load-bearing.
- Distinct from LLM-era MoE. The sparse per-token MoE under Mixtral / Switch Transformer is a different deployment shape; the cost analysis and failure modes don't transfer directly.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: MMoE in the shared trunk of a unified multi-surface ads ranking model; didn't pay off in isolation, did pay off in the unified context.
Related¶
- concepts/multi-task-learning — MMoE is a specific MTL architecture.
- concepts/multi-task-multi-label-ranking — the broader MTML ranking framing MMoE fits into.
- systems/pinterest-ads-engagement-model
- companies/pinterest