Skip to content

CONCEPT Cited by 1 source

Selective FP8 quantization

Definition

Selective FP8 quantization is a post-training quantisation strategy that applies FP8 precision only to model layers with high precision-loss tolerance, leaving precision-sensitive layers at higher precision (typically BF16 or FP16). Layer selection is driven by a micro-benchmark-guided selection mechanism that measures quality degradation per-layer against a task metric (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

It is the operational alternative to naive full-FP8 casts, which trade peak throughput for quality loss that rank-sensitive domains (ads, search, recsys) cannot absorb.

The problem it solves

FP8 on modern accelerators (Hopper / Blackwell / MI300) offers ~2× throughput over BF16 and halves memory pressure. For training or chat-LLM inference, blanket FP8 often works. For ranking models, Meta finds the opposite:

"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low-precision quantization often degrades the nuance required for complex ads ranking."

Ranking-model quality is extremely sensitive to small score deltas because the final ordering is what matters. A 0.1% NLL regression that would be invisible in a chatbot can collapse CTR / conversion lift. Blanket FP8 is out; doing nothing sacrifices the throughput win.

How selective FP8 works

  1. Micro-benchmark per-layer — measure the quality impact of quantising each layer independently to FP8, holding the rest at base precision.
  2. Identify high-tolerance layers — those where FP8 degrades the task metric by less than a threshold.
  3. Deploy FP8 only on the tolerant set — other layers stay at BF16 / FP16.

Meta's wording: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."

Why it's post-training (not QAT)

The post describes post-training quantisation (PTQ) — no retraining of the model in FP8 is required. The layer-selection decisions are made on an already-trained model, making the technique cheap to deploy: calibrate on a held-out eval set, emit a per-layer precision map, ship.

Quantisation-aware training (QAT) is a more invasive alternative that would retrain the model with FP8 noise in the forward pass; Meta's choice of PTQ is deployment-friendly but constrained by what FP8 the model was not trained for can tolerate.

Relationship to hardware-aware model architecture

Selective FP8 is a hardware-aware choice: it exists because modern GPUs ship Tensor Cores with dedicated FP8 paths that are faster than BF16 — the hardware exposes a tier the software can opportunistic- ally hit. On hardware without fast FP8 (older generations), the technique isn't worth the engineering.

Seen in

Caveats

  • Meta does not disclose the micro-benchmark metric (loss delta? downstream recommendation quality? task-specific KPI?) or the cutoff threshold.
  • Number / percentage of layers that land on FP8 is not disclosed.
  • Hardware mix — which accelerators are running the FP8 path — is not named (H100 / B200 / MI300X all have FP8 support).
  • The alternative of QAT is not evaluated in the post.
Last updated · 319 distilled / 1,201 read