CONCEPT Cited by 1 source
Selective FP8 quantization¶
Definition¶
Selective FP8 quantization is a post-training quantisation strategy that applies FP8 precision only to model layers with high precision-loss tolerance, leaving precision-sensitive layers at higher precision (typically BF16 or FP16). Layer selection is driven by a micro-benchmark-guided selection mechanism that measures quality degradation per-layer against a task metric (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
It is the operational alternative to naive full-FP8 casts, which trade peak throughput for quality loss that rank-sensitive domains (ads, search, recsys) cannot absorb.
The problem it solves¶
FP8 on modern accelerators (Hopper / Blackwell / MI300) offers ~2× throughput over BF16 and halves memory pressure. For training or chat-LLM inference, blanket FP8 often works. For ranking models, Meta finds the opposite:
"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low-precision quantization often degrades the nuance required for complex ads ranking."
Ranking-model quality is extremely sensitive to small score deltas because the final ordering is what matters. A 0.1% NLL regression that would be invisible in a chatbot can collapse CTR / conversion lift. Blanket FP8 is out; doing nothing sacrifices the throughput win.
How selective FP8 works¶
- Micro-benchmark per-layer — measure the quality impact of quantising each layer independently to FP8, holding the rest at base precision.
- Identify high-tolerance layers — those where FP8 degrades the task metric by less than a threshold.
- Deploy FP8 only on the tolerant set — other layers stay at BF16 / FP16.
Meta's wording: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."
Why it's post-training (not QAT)¶
The post describes post-training quantisation (PTQ) — no retraining of the model in FP8 is required. The layer-selection decisions are made on an already-trained model, making the technique cheap to deploy: calibrate on a held-out eval set, emit a per-layer precision map, ship.
Quantisation-aware training (QAT) is a more invasive alternative that would retrain the model with FP8 noise in the forward pass; Meta's choice of PTQ is deployment-friendly but constrained by what FP8 the model was not trained for can tolerate.
Relationship to hardware-aware model architecture¶
Selective FP8 is a hardware-aware choice: it exists because modern GPUs ship Tensor Cores with dedicated FP8 paths that are faster than BF16 — the hardware exposes a tier the software can opportunistic- ally hit. On hardware without fast FP8 (older generations), the technique isn't worth the engineering.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source; describes selective FP8 as one of two model-system co-design levers (the other being graph + kernel specialisation) that together drive MFU to 35% across heterogeneous hardware (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Caveats¶
- Meta does not disclose the micro-benchmark metric (loss delta? downstream recommendation quality? task-specific KPI?) or the cutoff threshold.
- Number / percentage of layers that land on FP8 is not disclosed.
- Hardware mix — which accelerators are running the FP8 path — is not named (H100 / B200 / MI300X all have FP8 support).
- The alternative of QAT is not evaluated in the post.