PATTERN Cited by 1 source
Selective mixed-precision quantization¶
Pattern¶
For quality-sensitive inference workloads where blanket low- precision casts (FP8, INT8) degrade task metrics unacceptably:
- Run a micro-benchmark per layer — quantise one layer at a time to the target low precision, hold the rest at base precision, measure task-metric delta.
- Emit a per-layer precision map — layers whose degradation falls below a cutoff → low precision; layers above the cutoff → base precision.
- Deploy the mixed-precision model post-training — no retraining required.
Captures the throughput benefit of low-precision hardware paths (Tensor Core FP8 is ~2× BF16 throughput) on the tolerant subset of the model, while protecting task quality on the sensitive subset (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Problem¶
Modern accelerators (Hopper, Blackwell, MI300) ship with dedicated low-precision paths — FP8 Tensor Cores offer ~2× BF16 throughput and halve memory footprint. Blanket casting a trained model to FP8 is the obvious throughput win. But for quality-sensitive domains — ranking, search, relevance, recommendation — blanket FP8 often produces unacceptable degradation:
"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low- precision quantization often degrades the nuance required for complex ads ranking."
Ranking models are particularly precision-sensitive because the ordering of scores is what matters, not just absolute value; small score-delta noise can flip critical ordering decisions.
The alternative — keeping everything at BF16 — gives up the throughput and memory wins entirely.
Solution¶
Quantise selectively, not uniformly:
Micro-benchmark-guided layer selection¶
For each candidate layer:
- Clone the trained model.
- Quantise that single layer to FP8.
- Measure task-metric delta (log-loss, NDCG, CTR, conversion lift, etc.) on a calibration set.
- If the delta is below a threshold, mark the layer as FP8- tolerant.
Meta's description: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."
Post-training deployment¶
Because the technique is post-training quantisation (PTQ) — no model retraining required — it's cheap to deploy and iterate. This is a tradeoff vs. quantisation-aware training (QAT), which would retrain with quantisation noise in the forward pass and often achieves better per-layer tolerance, but at the cost of full model retraining.
Forces¶
- Throughput is valuable (FP8 roughly doubles it on modern accelerators).
- Memory is valuable (FP8 halves model-weight memory).
- Quality is non-negotiable (ranking / recsys / search can't absorb even small metric regressions).
- Retraining is expensive (PTQ is cheap; QAT is not).
Consequences¶
Positive:
- Captures most of the FP8 throughput win while preserving task quality.
- No retraining required; cheap to iterate on the precision map.
- Applies to already-deployed models.
Negative / tradeoffs:
- The per-layer benchmark is a calibration-set-dependent decision — calibration-set drift can cause the precision map to become stale.
- Mixed-precision code paths are more complex than uniform- precision ones; kernel coverage + maintenance cost goes up.
- Requires hardware that supports both precisions efficiently (Hopper/Blackwell/MI300 — older hardware can't benefit).
- PTQ has a tolerance ceiling; some layers that QAT could quantise cleanly will stay at higher precision.
Canonical industrial instance¶
- Meta Adaptive Ranking Model — the post defining the technique for LLM-scale ads ranking. Combined with operator fusion + Grouped GEMM + horizontal fusion (see patterns/model-system-codesign-ranking) to drive MFU to 35% across heterogeneous hardware.
Related patterns¶
- patterns/model-system-codesign-ranking — selective FP8 is one of four co-design levers in that pattern.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).