CONCEPT Cited by 2 sources

Selective FP8 quantization¶

Definition¶

Selective FP8 quantization is a post-training quantisation strategy that applies FP8 precision only to model layers with high precision-loss tolerance, leaving precision-sensitive layers at higher precision (typically BF16 or FP16). Layer selection is driven by a micro-benchmark-guided selection mechanism that measures quality degradation per-layer against a task metric (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

It is the operational alternative to naive full-FP8 casts, which trade peak throughput for quality loss that rank-sensitive domains (ads, search, recsys) cannot absorb.

The problem it solves¶

FP8 on modern accelerators (Hopper / Blackwell / MI300) offers ~2× throughput over BF16 and halves memory pressure. For training or chat-LLM inference, blanket FP8 often works. For ranking models, Meta finds the opposite:

"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low-precision quantization often degrades the nuance required for complex ads ranking."

Ranking-model quality is extremely sensitive to small score deltas because the final ordering is what matters. A 0.1% NLL regression that would be invisible in a chatbot can collapse CTR / conversion lift. Blanket FP8 is out; doing nothing sacrifices the throughput win.

How selective FP8 works¶

Micro-benchmark per-layer — measure the quality impact of quantising each layer independently to FP8, holding the rest at base precision.
Identify high-tolerance layers — those where FP8 degrades the task metric by less than a threshold.
Deploy FP8 only on the tolerant set — other layers stay at BF16 / FP16.

Meta's wording: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."

Why it's post-training (not QAT)¶

The post describes post-training quantisation (PTQ) — no retraining of the model in FP8 is required. The layer-selection decisions are made on an already-trained model, making the technique cheap to deploy: calibrate on a held-out eval set, emit a per-layer precision map, ship.

Quantisation-aware training (QAT) is a more invasive alternative that would retrain the model with FP8 noise in the forward pass; Meta's choice of PTQ is deployment-friendly but constrained by what FP8 the model was not trained for can tolerate.

Relationship to hardware-aware model architecture¶

Selective FP8 is a hardware-aware choice: it exists because modern GPUs ship Tensor Cores with dedicated FP8 paths that are faster than BF16 — the hardware exposes a tier the software can opportunistic- ally hit. On hardware without fast FP8 (older generations), the technique isn't worth the engineering.

Seen in¶

2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source; describes selective FP8 as one of two model-system co-design levers (the other being graph + kernel specialisation) that together drive MFU to 35% across heterogeneous hardware (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
2026-05-08 Databricks × Superhuman — selective FP8 at 200K-QPS LLM serving with KV-cache explicitly off. Joint Databricks Model Serving / Superhuman post applies the selective-FP8 discipline to a production LLM serving workload at 200,000+ QPS on H100. Final config: attention projections (Q, K, V, output) and MLP projections all run through the FP8 path; KV-cache quantisation explicitly disabled. Quote: "weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload." This is the wiki's first canonical KV-cache-quantisation-explicitly-off datum — confirming that the selective-FP8 discipline named by the Meta ranking-model post generalises to the LLM-serving regime, with KV-cache as the canonical layer-class to leave at higher precision. Workload-specific framing: "weight quantization was where the throughput wins came from" — meaning the FP8 throughput win is concentrated in the linear-layer paths, not the KV-cache. Different workloads (longer context, decoder-heavy generation) may reach a different conclusion. The FP8 layer-set was tuned jointly: MLP projections were quantised from the start; the open question was attention. The Databricks runtime ships a hybrid- precision toggle flag so attention quantisation can be turned on/off per layer group without architectural change, letting both teams measure quality directly — and the experiment landed with "no measurable quality degradation" on Superhuman's internal eval harness. Quantisation granularity was also load-bearing: Databricks' kernels use per-channel scaling rather than the off-the-shelf per-tensor scaling, which "matched or exceeded other open source baselines at the same throughput." Net throughput contribution: up to +30% per-pod QPS, the single largest win of the migration. (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

Caveats¶

Meta does not disclose the micro-benchmark metric (loss delta? downstream recommendation quality? task-specific KPI?) or the cutoff threshold.
Number / percentage of layers that land on FP8 is not disclosed.
Hardware mix — which accelerators are running the FP8 path — is not named (H100 / B200 / MI300X all have FP8 support).
The alternative of QAT is not evaluated in the post.