Skip to content

PATTERN Cited by 1 source

Hardware-native quantization

Pattern

Push quantization metadata (scales, zero-points, block-scale exponents) into the matrix-unit instruction itself rather than dequantizing quantized operands in software before each MMA. The hardware consumes packed low-bit operands + their associated scaling factors in one fused instruction — no software dequant step.

Canonical instance: MXFP / NVFP4 on NVIDIA Tensor Cores via the block_scale modifier on tcgen05.mma (sm_100) / mma.sync (sm_120) (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why

Pre-MXFP quantization paid a software dequantization tax: when activations and weights used different bit-widths, the kernel dequantized the low-bit operand up to the higher precision before the MMA, then ran the MMA at the higher precision. The dequant was pure arithmetic overhead on the MMA path.

Consequences in pre-MXFP land:

  • In memory-bound workloads the dequant was "free" (the matrix unit was idle waiting for data), so weight-only quantization won.
  • In compute-bound workloads the dequant actively slowed execution — Dropbox reports A16W4 "performing worse than 16-bit matmul" due to the extra arithmetic (see patterns/weight-only-vs-activation-quantization).

Hardware-native quantization closes this gap: the MMA instruction itself operates on quantized inputs without a pre-pass. Mixed-precision MMAs (e.g. MXFP8 × MXFP4) become one atomic op; activations can use MXFP8/MXFP6/MXFP4 while weights remain MXFP4 without dequant overhead (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Shape of the fix

The OCP MXFP spec standardizes:

  • Block-scale granularity — 32 contiguous elements share a scale, stored alongside the packed operand (see patterns/grouped-linear-quantization)
  • Scale format — E8M0 (8-bit exponent, 0-bit mantissa — literally a power-of-two exponent); NVFP4 instead uses E4M3 FP8 with a finer 16-element group for better numerical stability
  • Symmetric quantization — no zero-point needed; simpler hardware
  • Mixed-precision MMA support — MXFP8 × MXFP4 etc. in one instruction

The GPU loads packed operands + their per-block scales + does the scaled MMA + accumulates — all in one instruction. No dequant-up, no per-element scale-multiply in software.

Trade-offs

  • Accuracy dip at very low bits. E8M0 power-of-two scales cause a noticeable accuracy drop at MXFP4. NVFP4's E4M3 scales
  • 16-element groups recover most of it; Dropbox's fp4 blogpost shows simple post-training adjustments restore most MXFP4 model quality.
  • Not portable across architectures. Kernels compiled for sm_100 don't run on sm_120 — different Tensor Core instruction encodings. Triton recently added cross-device MXFP support on sm_120 (PR 8494), widening the portability envelope.
  • Ecosystem lag. Framework support (PyTorch kernels, inference runtimes, model zoos) is still catching up; mainstream focus is on server-grade B200/B300 GPUs; FP4 models are not yet widely available.

Relationship to other patterns

Seen in

  • sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical articulation of the pre-MXFP vs MXFP divide as "software-managed scaling and explicit dequantization" vs "moving these operations directly into Tensor Core hardware"; Dropbox positions MXFP/NVFP adoption as an ongoing production-viability trade-off for Dash, gated by runtime + model-availability ecosystem.
Last updated · 200 distilled / 1,178 read