Skip to content

PATTERN Cited by 1 source

Selective mixed-precision quantization

Pattern

For quality-sensitive inference workloads where blanket low- precision casts (FP8, INT8) degrade task metrics unacceptably:

  1. Run a micro-benchmark per layer — quantise one layer at a time to the target low precision, hold the rest at base precision, measure task-metric delta.
  2. Emit a per-layer precision map — layers whose degradation falls below a cutoff → low precision; layers above the cutoff → base precision.
  3. Deploy the mixed-precision model post-training — no retraining required.

Captures the throughput benefit of low-precision hardware paths (Tensor Core FP8 is ~2× BF16 throughput) on the tolerant subset of the model, while protecting task quality on the sensitive subset (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Problem

Modern accelerators (Hopper, Blackwell, MI300) ship with dedicated low-precision paths — FP8 Tensor Cores offer ~2× BF16 throughput and halve memory footprint. Blanket casting a trained model to FP8 is the obvious throughput win. But for quality-sensitive domains — ranking, search, relevance, recommendation — blanket FP8 often produces unacceptable degradation:

"Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low- precision quantization often degrades the nuance required for complex ads ranking."

Ranking models are particularly precision-sensitive because the ordering of scores is what matters, not just absolute value; small score-delta noise can flip critical ordering decisions.

The alternative — keeping everything at BF16 — gives up the throughput and memory wins entirely.

Solution

Quantise selectively, not uniformly:

Micro-benchmark-guided layer selection

For each candidate layer:

  1. Clone the trained model.
  2. Quantise that single layer to FP8.
  3. Measure task-metric delta (log-loss, NDCG, CTR, conversion lift, etc.) on a calibration set.
  4. If the delta is below a threshold, mark the layer as FP8- tolerant.

Meta's description: "Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality."

Post-training deployment

Because the technique is post-training quantisation (PTQ) — no model retraining required — it's cheap to deploy and iterate. This is a tradeoff vs. quantisation-aware training (QAT), which would retrain with quantisation noise in the forward pass and often achieves better per-layer tolerance, but at the cost of full model retraining.

Forces

  • Throughput is valuable (FP8 roughly doubles it on modern accelerators).
  • Memory is valuable (FP8 halves model-weight memory).
  • Quality is non-negotiable (ranking / recsys / search can't absorb even small metric regressions).
  • Retraining is expensive (PTQ is cheap; QAT is not).

Consequences

Positive:

  • Captures most of the FP8 throughput win while preserving task quality.
  • No retraining required; cheap to iterate on the precision map.
  • Applies to already-deployed models.

Negative / tradeoffs:

  • The per-layer benchmark is a calibration-set-dependent decision — calibration-set drift can cause the precision map to become stale.
  • Mixed-precision code paths are more complex than uniform- precision ones; kernel coverage + maintenance cost goes up.
  • Requires hardware that supports both precisions efficiently (Hopper/Blackwell/MI300 — older hardware can't benefit).
  • PTQ has a tolerance ceiling; some layers that QAT could quantise cleanly will stay at higher precision.

Canonical industrial instance

Seen in

Last updated · 319 distilled / 1,201 read