CONCEPT Cited by 2 sources

Model FLOPs Utilization (MFU)¶

Definition¶

Model FLOPs Utilization (MFU) is the ratio of FLOPs actually performed by a model's training or inference run to the peak theoretical FLOPs of the hardware over the same wall-clock duration. It is the industry-standard efficiency metric for GPU / TPU workloads — independent of batch size or sequence length — popularised by Google's 2022 PaLM paper for training and widely adopted as a serving metric thereafter. An MFU of 1.0 means the hardware's compute units are saturated with useful model arithmetic; low MFU means the GPU is spending time on memory loads, kernel launches, scheduling, or other non-arithmetic work (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Formula¶

Informally:

MFU = (measured FLOPs per second achieved)
      / (hardware peak FLOPs per second)

For transformer inference, numerator is computed per-pass from model architecture (layers × attention + MLP FLOPs per token × tokens per pass) divided by pass wall-clock. Denominator is the GPU's published peak sustainable FLOPs for the relevant dtype (fp16 / bf16 / fp8).

Why low MFU signals wasted GPU¶

When MFU is low, the GPU's compute units are idle — most commonly because the workload is memory-bound (waiting for weights / KV cache to load from HBM) or overwhelmed by per-request fixed overheads (kernel launches, scheduling, mask setup, pooling, normalisation). In serving, particularly short-request workloads, low MFU directly maps to $ / inference waste: the GPU is provisioned at full cost but used at a fraction of capacity.

MFU's shape vs token count¶

Voyage AI's 2025-12-18 profiling of query embedding inference plots MFU vs token count alongside throughput vs token count. Both scale approximately linearly with token count until reaching the saturation point, at which point MFU plateaus near its hardware-dependent peak. Most query inferences in their serving system live "in the memory-bound zone, far away from the saturation point" — MFU is low, the GPU is underused, and the inefficiency compounds over spiky query traffic.

Why batching raises MFU¶

Token-count batching combines many short requests into one forward pass. Each request's fixed per-pass overheads are paid once per super-sequence instead of once per request. Each request's actual arithmetic work (attention over its tokens + MLP) happens in the same weight-loaded GPU pass, so the weights are loaded once and used against many sequences' tokens. Arithmetic intensity rises, memory-bandwidth waste shrinks, and MFU rises near-linearly until the saturation point is reached.

From the Voyage AI post:

"Batching short requests can move the inference from memory-bound to compute-bound. If we choose the saturation point in Figure 3 as the batch size (total token count in the batch), the latency and throughput/MFU can be balanced and optimized."

MFU vs throughput vs latency¶

MFU = efficiency of the GPU's compute utilisation.
Throughput = requests (or tokens) completed per second.
Latency = wall-clock for any single request.

The three move together until the saturation point: raising MFU via batching raises throughput linearly (more work per second) and lowers per-request latency (amortised fixed costs, shared weight loads). Past the saturation point MFU stays near peak but latency starts scaling linearly with batch token-count — throughput stays saturated, latency grows. The optimal operating point is the saturation point itself.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance. MFU explicitly named as one of the two axes profiled alongside throughput against token count; saturation-point token-count batching keeps MFU near peak while staying in the latency-flat region (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
2026-03-31 Meta — Meta Adaptive Ranking Model — canonical recsys-serving wiki instance. Meta reports 35% MFU across multiple hardware types driven by hardware-aware model architectures + selective FP8 + operator fusion + Grouped GEMM + horizontal fusion — the complete model-system co-design playbook applied at LLM-scale ranking (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).