PATTERN Cited by 1 source

Selective mixed-precision quantization¶

Pattern¶

For quality-sensitive inference workloads where blanket low- precision casts (FP8, INT8) degrade task metrics unacceptably:

Run a micro-benchmark per layer — quantise one layer at a time to the target low precision, hold the rest at base precision, measure task-metric delta.
Emit a per-layer precision map — layers whose degradation falls below a cutoff → low precision; layers above the cutoff → base precision.
Deploy the mixed-precision model post-training — no retraining required.

Captures the throughput benefit of low-precision hardware paths (Tensor Core FP8 is ~2× BF16 throughput) on the tolerant subset of the model, while protecting task quality on the sensitive subset (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Problem¶

Modern accelerators (Hopper, Blackwell, MI300) ship with dedicated low-precision paths — FP8 Tensor Cores offer ~2× BF16 throughput and halve memory footprint. Blanket casting a trained model to FP8 is the obvious throughput win. But for quality-sensitive domains — ranking, search, relevance, recommendation — blanket FP8 often produces unacceptable degradation:

"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low- precision quantization often degrades the nuance required for complex ads ranking."

Ranking models are particularly precision-sensitive because the ordering of scores is what matters, not just absolute value; small score-delta noise can flip critical ordering decisions.

The alternative — keeping everything at BF16 — gives up the throughput and memory wins entirely.

Solution¶

Quantise selectively, not uniformly:

Micro-benchmark-guided layer selection¶

For each candidate layer:

Clone the trained model.
Quantise that single layer to FP8.
Measure task-metric delta (log-loss, NDCG, CTR, conversion lift, etc.) on a calibration set.
If the delta is below a threshold, mark the layer as FP8- tolerant.

Meta's description: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."

Post-training deployment¶

Because the technique is post-training quantisation (PTQ) — no model retraining required — it's cheap to deploy and iterate. This is a tradeoff vs. quantisation-aware training (QAT), which would retrain with quantisation noise in the forward pass and often achieves better per-layer tolerance, but at the cost of full model retraining.

Forces¶

Throughput is valuable (FP8 roughly doubles it on modern accelerators).
Memory is valuable (FP8 halves model-weight memory).
Quality is non-negotiable (ranking / recsys / search can't absorb even small metric regressions).
Retraining is expensive (PTQ is cheap; QAT is not).

Consequences¶

Positive:

Captures most of the FP8 throughput win while preserving task quality.
No retraining required; cheap to iterate on the precision map.
Applies to already-deployed models.

Negative / tradeoffs:

The per-layer benchmark is a calibration-set-dependent decision — calibration-set drift can cause the precision map to become stale.
Mixed-precision code paths are more complex than uniform- precision ones; kernel coverage + maintenance cost goes up.
Requires hardware that supports both precisions efficiently (Hopper/Blackwell/MI300 — older hardware can't benefit).
PTQ has a tolerance ceiling; some layers that QAT could quantise cleanly will stay at higher precision.

Canonical industrial instance¶

Meta Adaptive Ranking Model — the post defining the technique for LLM-scale ads ranking. Combined with operator fusion + Grouped GEMM + horizontal fusion (see patterns/model-system-codesign-ranking) to drive MFU to 35% across heterogeneous hardware.

patterns/model-system-codesign-ranking — selective FP8 is one of four co-design levers in that pattern.

Seen in¶

2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).