CONCEPT Cited by 1 source
Muon optimizer¶
Definition¶
Muon is a neural-network optimiser popularised in 2024 (Keller Jordan et al., jeremyjordan.me / blog posts / arXiv 2502.16982) designed specifically for hidden-layer matrices of Transformers and similar architectures. Its defining step is the matrix-orthogonalisation update — project the raw gradient / momentum buffer onto the nearest orthogonal matrix via a Newton-Schulz iteration before applying it to the weight. The resulting updates have better conditioning than vanilla SGD + momentum, and empirically reach matching or better loss than AdamW on comparable training budgets for Transformer language models.
Muon is not a drop-in replacement for every parameter class — it is specifically designed for hidden parameters (weight matrices of linear / attention / MLP layers). Other parameter classes (embeddings, layer-norm gains, biases) usually stay on AdamW. The practical recipe is Muon for hidden; AdamW for the rest.
Canonical wiki reference — MediaFM¶
This wiki's canonical reference for Muon is Netflix's MediaFM post (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding):
"We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It's worth noting that the switch to Muon resulted in noticeable improvements."
Netflix's flag is terse but directionally strong: Muon is a differential engineering win over pure-AdamW training of the MediaFM encoder — enough that Netflix names it in a blog that otherwise omits most low-level training details. No numerical delta is reported.
Why Muon works on hidden matrices¶
The Newton-Schulz orthogonalisation step:
- Whitens update magnitudes across the singular-value spectrum of the gradient. Vanilla SGD + momentum (and to a lesser extent AdamW) can let the largest singular directions dominate updates, leaving smaller-magnitude directions underfit.
- Applies a kind of second-order preconditioning without requiring the full Hessian-inverse computation of genuine second- order methods (Newton, K-FAC, Shampoo). The orthogonalisation is cheap — a few matrix multiplies per step per parameter matrix.
- Improves the effective step size budget. In practice Muon can use larger learning rates than AdamW would tolerate at matching instability levels.
When to use Muon¶
- Transformer pre-training or fine-tuning where hidden-layer matrices dominate parameter count.
- Large-batch training — Muon's conditioning benefit compounds with the gradient-noise reduction of large batches.
- Budget-limited runs where reaching a quality bar in fewer steps matters.
When not to use Muon¶
- Small models where AdamW is already cheap and the orthogonalisation adds relative overhead.
- Non-matrix parameters (embeddings, layer-norms, biases) — keep AdamW for those; the orthogonalisation is meaningless on 1-D parameter tensors.
- Deployment environments without a Muon-compatible reference implementation — the Newton-Schulz iteration + parameter- class routing takes some plumbing.
What MediaFM doesn't disclose¶
- The specific Muon implementation / library used.
- Mix ratio of Muon vs AdamW parameter classes (beyond "hidden parameters" / "remaining parameters").
- Learning rate schedule.
- Comparison loss curves or final-metric deltas vs a pure-AdamW run.
Netflix's flag is at the "noticeable improvements" level — useful as a signal that Muon is working on production-scale Transformer training at Netflix, not as a quantitative recipe.
Relationship to adjacent optimisers¶
- AdamW — the incumbent default; Muon's natural comparison baseline.
- Shampoo / K-FAC — full-matrix second-order methods; more expensive than Muon per step, typically better conditioning per step. Muon is cheaper and closes much of the gap.
- Lion — another 2024-ish alternative to AdamW, sign-based update; different mechanism (no orthogonalisation).
Caveats¶
- Muon is newer than AdamW; hyperparameter intuition is less developed across the community.
- Production-scale reports (like MediaFM's) increase the confidence that Muon generalises beyond its reference papers, but the evidence base is still thinner than AdamW's.
- The "hidden params only" rule creates two optimiser states to checkpoint + configure — a minor operational cost.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; Muon named as the hidden-parameter optimiser for MediaFM, with AdamW on the remainder; Netflix flags the switch to Muon as delivering "noticeable improvements" but provides no numerical ablation.
Related¶
- systems/netflix-mediafm — canonical wiki production consumer.