Skip to content

CONCEPT Cited by 1 source

Mixture of Experts (MoE / MMoE)

Definition

Mixture of Experts (MoE) is the neural network architecture pattern where a set of specialist subnetworks ("experts") process inputs, and a gating network routes each input (or each example, or each task) to a subset of experts whose outputs are combined. Under MMoE (Multi-gate Mixture of Experts) — the recsys-specific variant introduced by Ma et al. (Google, 2018) — per-task gates learn independent soft-routing over the experts, letting each task develop its own pattern of expert usage while the experts themselves remain shared.

         (input features)
     ┌──────────┼──────────┐
     ▼          ▼          ▼
  Expert 1   Expert 2   Expert N
     │          │          │
     └──────────┼──────────┘
           (expert outputs)
       ┌────────┼────────┐
       ▼        ▼        ▼
   gate A   gate B   gate C
       │        │        │
    Task A   Task B   Task C

MoE variants differ in:

  • Routing granularity — per-token (LLMs: Switch Transformer, Mixtral), per-example (recsys), per-task (MMoE).
  • Sparsity — top-1, top-k, or dense (all experts used with soft weights).
  • Gate architecture — softmax, noisy top-k, learned vs heuristic.

Canonical wiki instance — Pinterest ads engagement model

Pinterest uses MMoE as one of the key architectural elements in the shared trunk of its unified ads engagement model (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]."

Footnote [1] references Pinterest's prior post Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development — not ingested on the wiki.

The load-bearing Pinterest finding: MMoE only delivered consistent gains when integrated into the unified model trained on multi-surface data. Applying MMoE to a single surface in isolation produced "no consistent gains or unfavorable cost-gain trade-off". The interpretation: MMoE's expert specialisation needs heterogeneous inputs to justify its cost — surface-specific training data is too homogeneous to exploit the expert routing effectively, but combined multi-surface data gives the experts meaningful sub-populations to specialise over.

Why MMoE in recsys

  • Expert specialisation without task interference. Classical MTL with shared bottom layers suffers when tasks have conflicting gradients. MMoE routes tasks to task-weighted combinations of experts, letting each task "pick" the subset of shared capacity that helps it.
  • Soft specialisation. Each task learns a soft-routing gate — not a hard task-to-expert mapping. Experts can be shared flexibly across tasks.
  • Cost scales sub-linearly. Experts are shared (total parameters ≈ N × expert_size), but each task uses all experts with learned weights — typically dense, not sparse at the task gate.

MMoE vs the LLM MoE variants

  • Per-task gates (MMoE) — recsys/ranking default. Gates select expert combinations per task.
  • Per-token gates (Switch, GShard, Mixtral) — LLM default. Gates select top-k experts per input token, typically sparse.

Pinterest's MMoE is the per-task variant, aligned with the multi-task ads-ranking use case (HF CTR + SR CTR as separate tasks).

Caveats

  • Pinterest doesn't disclose expert count, expert capacity, gate architecture, dense vs sparse routing, or the knowledge-distillation mechanism hinted at in the referenced prior post's title.
  • MMoE in isolation didn't work per the Pinterest post — the unified-model context (multi-surface data + long sequences + surface-specific tower trees + surface-specific calibration) is load-bearing.
  • Distinct from LLM-era MoE. The sparse per-token MoE under Mixtral / Switch Transformer is a different deployment shape; the cost analysis and failure modes don't transfer directly.

Seen in

Last updated · 319 distilled / 1,201 read