PATTERN Cited by 1 source
Hardware-native quantization¶
Pattern¶
Push quantization metadata (scales, zero-points, block-scale exponents) into the matrix-unit instruction itself rather than dequantizing quantized operands in software before each MMA. The hardware consumes packed low-bit operands + their associated scaling factors in one fused instruction — no software dequant step.
Canonical instance:
MXFP /
NVFP4
on NVIDIA Tensor Cores via the
block_scale modifier on tcgen05.mma (sm_100) /
mma.sync (sm_120)
(Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Why¶
Pre-MXFP quantization paid a software dequantization tax: when activations and weights used different bit-widths, the kernel dequantized the low-bit operand up to the higher precision before the MMA, then ran the MMA at the higher precision. The dequant was pure arithmetic overhead on the MMA path.
Consequences in pre-MXFP land:
- In memory-bound workloads the dequant was "free" (the matrix unit was idle waiting for data), so weight-only quantization won.
- In compute-bound workloads the dequant actively slowed execution — Dropbox reports A16W4 "performing worse than 16-bit matmul" due to the extra arithmetic (see patterns/weight-only-vs-activation-quantization).
Hardware-native quantization closes this gap: the MMA instruction itself operates on quantized inputs without a pre-pass. Mixed-precision MMAs (e.g. MXFP8 × MXFP4) become one atomic op; activations can use MXFP8/MXFP6/MXFP4 while weights remain MXFP4 without dequant overhead (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Shape of the fix¶
The OCP MXFP spec standardizes:
- Block-scale granularity — 32 contiguous elements share a scale, stored alongside the packed operand (see patterns/grouped-linear-quantization)
- Scale format — E8M0 (8-bit exponent, 0-bit mantissa — literally a power-of-two exponent); NVFP4 instead uses E4M3 FP8 with a finer 16-element group for better numerical stability
- Symmetric quantization — no zero-point needed; simpler hardware
- Mixed-precision MMA support — MXFP8 × MXFP4 etc. in one instruction
The GPU loads packed operands + their per-block scales + does the scaled MMA + accumulates — all in one instruction. No dequant-up, no per-element scale-multiply in software.
Trade-offs¶
- Accuracy dip at very low bits. E8M0 power-of-two scales cause a noticeable accuracy drop at MXFP4. NVFP4's E4M3 scales
- 16-element groups recover most of it; Dropbox's fp4 blogpost shows simple post-training adjustments restore most MXFP4 model quality.
- Not portable across architectures. Kernels compiled for
sm_100don't run onsm_120— different Tensor Core instruction encodings. Triton recently added cross-device MXFP support on sm_120 (PR 8494), widening the portability envelope. - Ecosystem lag. Framework support (PyTorch kernels, inference runtimes, model zoos) is still catching up; mainstream focus is on server-grade B200/B300 GPUs; FP4 models are not yet widely available.
Relationship to other patterns¶
- Enables patterns/weight-only-vs-activation-quantization to be a design axis rather than a hardware-forced trade-off — mixed-precision MMA lets A8W4 / A6W4 / A4W4 all run natively.
- Generalizes patterns/grouped-linear-quantization by making grouping the hardware contract, not a software abstraction.
- Instance of concepts/hardware-software-codesign — the OCP spec was co-designed across NVIDIA, AMD, and the model- serving community so that format choices align with matrix- unit capabilities.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical articulation of the pre-MXFP vs MXFP divide as "software-managed scaling and explicit dequantization" vs "moving these operations directly into Tensor Core hardware"; Dropbox positions MXFP/NVFP adoption as an ongoing production-viability trade-off for Dash, gated by runtime + model-availability ecosystem.