CONCEPT Cited by 1 source

MFU (Model FLOPS Utilization)¶

Definition¶

MFU (Model FLOPS Utilization) is a GPU-training efficiency metric: the ratio of observed model FLOPS achieved to the theoretical peak FLOPS of the hardware. Expressed as a percentage, it captures how close a real training workload gets to the silicon's ceiling, factoring in the model architecture's FLOP count per step and the observed step latency.

MFU is the de-facto efficiency metric tracked alongside loss during LLM training runs — if loss is how well the model is learning, MFU is how efficiently the cluster is achieving that learning.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.

How it's computed (shape)¶

MFU = (model_flops_per_step × steps_per_second) / (peak_flops_per_gpu × num_gpus)

Where:

model_flops_per_step is computed from the architecture (layer count, hidden dim, FFN dim, attention shape, seq length, batch size). For a transformer roughly 6 × N × D_tokens for forward+backward of an N-parameter model over D_tokens tokens.
peak_flops_per_gpu is the vendor-published theoretical peak for the precision in use (BF16, FP8, etc.).
num_gpus × peak_flops_per_gpu is the cluster's aggregate ceiling.

The subtleties Netflix specifically calls out: MFU must remain accurate under custom architectures and LoRA — meaning the FLOPS accounting has to account correctly for non-standard architectures (MoE routers, custom projection heads, semantic-ID tokenizer paths) and low-rank adapters (which change the parameter count but not proportionally to the compute).

Role in Netflix's framework¶

Per the Compute pillar of Netflix's Post-Training Framework:

"MFU (Model FLOPS Utilization) monitoring that remains accurate under custom architectures and LoRA"

Two reasons this matters:

Experiment tracking. Model quality (loss) and training efficiency (MFU) are the two first-class metrics surfaced to model developers. Knowing a change improved loss is only useful if you also know it didn't destroy throughput.
Performance regression detection. When the framework auto-applies vocab padding or introduces new sharding strategies, MFU is the signal that confirms (or denies) the win.

MFU-affecting wins documented in the Netflix post¶

Async sequence packing: up to 4.7× effective token throughput on the most skewed dataset → directly increases MFU by eliminating padding waste and straggler stalls.
Vocab padding to multiples of 64: avoids a 3× LM-head slowdown from cuBLAS → CUTLASS fallback → directly preserves MFU on vocab-expanded workloads.

Why accuracy-under-LoRA is a non-trivial requirement¶

Standard MFU calculators assume a dense forward/backward over all N model parameters. Under LoRA, the trainable parameters are the low-rank adapters (<< N), but the forward pass still computes the full N-parameter forward. Naive MFU calculators can either:

Undercount (only count LoRA-param FLOPS) → overstates effective FLOPS, understates MFU.
Overcount (count full model as trainable) → understates effective FLOPS, overstates MFU.

Getting MFU right under LoRA means counting forward + LoRA-backward correctly — Netflix calls this out as a framework property it preserves.