CONCEPT Cited by 1 source
MFU (Model FLOPS Utilization)¶
Definition¶
MFU (Model FLOPS Utilization) is a GPU-training efficiency metric: the ratio of observed model FLOPS achieved to the theoretical peak FLOPS of the hardware. Expressed as a percentage, it captures how close a real training workload gets to the silicon's ceiling, factoring in the model architecture's FLOP count per step and the observed step latency.
MFU is the de-facto efficiency metric tracked alongside loss during LLM training runs — if loss is how well the model is learning, MFU is how efficiently the cluster is achieving that learning.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.
How it's computed (shape)¶
Where:
- model_flops_per_step is computed from the architecture (layer count, hidden dim, FFN dim, attention shape, seq length, batch size). For a transformer roughly
6 × N × D_tokensfor forward+backward of an N-parameter model overD_tokenstokens. - peak_flops_per_gpu is the vendor-published theoretical peak for the precision in use (BF16, FP8, etc.).
- num_gpus × peak_flops_per_gpu is the cluster's aggregate ceiling.
The subtleties Netflix specifically calls out: MFU must remain accurate under custom architectures and LoRA — meaning the FLOPS accounting has to account correctly for non-standard architectures (MoE routers, custom projection heads, semantic-ID tokenizer paths) and low-rank adapters (which change the parameter count but not proportionally to the compute).
Role in Netflix's framework¶
Per the Compute pillar of Netflix's Post-Training Framework:
"MFU (Model FLOPS Utilization) monitoring that remains accurate under custom architectures and LoRA"
Two reasons this matters:
- Experiment tracking. Model quality (loss) and training efficiency (MFU) are the two first-class metrics surfaced to model developers. Knowing a change improved loss is only useful if you also know it didn't destroy throughput.
- Performance regression detection. When the framework auto-applies vocab padding or introduces new sharding strategies, MFU is the signal that confirms (or denies) the win.
MFU-affecting wins documented in the Netflix post¶
- Async sequence packing: up to 4.7× effective token throughput on the most skewed dataset → directly increases MFU by eliminating padding waste and straggler stalls.
- Vocab padding to multiples of 64: avoids a 3× LM-head slowdown from cuBLAS → CUTLASS fallback → directly preserves MFU on vocab-expanded workloads.
Why accuracy-under-LoRA is a non-trivial requirement¶
Standard MFU calculators assume a dense forward/backward over all N model parameters. Under LoRA, the trainable parameters are the low-rank adapters (<< N), but the forward pass still computes the full N-parameter forward. Naive MFU calculators can either:
- Undercount (only count LoRA-param FLOPS) → overstates effective FLOPS, understates MFU.
- Overcount (count full model as trainable) → understates effective FLOPS, overstates MFU.
Getting MFU right under LoRA means counting forward + LoRA-backward correctly — Netflix calls this out as a framework property it preserves.