SYSTEM Cited by 1 source
MXFP microscaling formats¶
Definition¶
MXFP (Microscaling Formats) is an Open Compute Project spec (OCP Microscaling Formats MX Specification v1.0) standardizing a family of low-bit data types with native hardware support on GPU matrix units. MXFP types include MXFP8 / MXFP6 / MXFP4 (floating-point) and MXINT8 (integer), each using symmetric quantization with fixed block size 32 and a shared scaling factor per block in E8M0 format. Tensor Cores operate directly on packed MXFP operands + their block scales in a single fused MMA instruction — no software dequantization step (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Why it matters¶
MXFP is the architectural break between pre-MXFP quantization (software-managed scaling + explicit dequantization before each MMA) and hardware-native quantization (MMAs consume quantized operands directly). Concretely:
- Pre-MXFP: kernel loads 4-bit weights → unpacks → rescales to FP16 → MMA at FP16. The dequant is pure overhead on the MMA path; in compute-bound regimes it makes A16W4 slower than 16-bit matmul.
- MXFP: kernel issues one block-scaled MMA instruction
(
tcgen05.mmaon sm_100,mma.syncwithblock_scaleon sm_120) that consumes MXFP operands + E8M0 scales + runs the MMA, all fused. The dequant tax is gone.
Format family¶
Per the OCP spec (Figure 4 in the Dropbox post, sourced from Table 1 of the spec PDF):
| Type | Element format | Block size | Scale format |
|---|---|---|---|
| MXFP8 | FP8 (E5M2 or E4M3) | 32 | E8M0 |
| MXFP6 | FP6 (E3M2 or E2M3) | 32 | E8M0 |
| MXFP4 | FP4 (E2M1) | 32 | E8M0 |
| MXINT8 | INT8 | 32 | E8M0 |
Mixed-precision MMA¶
MXFP explicitly supports mixed-precision MMA instructions on some hardware, e.g. MXFP8 × MXFP4. This gives practitioners a new axis: activations can run at MXFP8 / MXFP6 / MXFP4 while weights stay at MXFP4 — previously a pre-MXFP-only option (A16W4) that cost explicit dequant, now native in hardware (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
E8M0: the scale format¶
E8M0 = 8-bit exponent, 0-bit mantissa. Effectively a power-of-two scale: values in [2⁻¹²⁷, 2¹²⁷]. Scales are typically chosen as:
so scale values are mostly ≤ 1, and extremely small magnitudes are rarely needed (~2⁻¹⁵ typically sufficient for near-zero weights). The representable range is much wider than needed in practice, raising the theoretical question whether fewer scale bits would suffice — but the spec locks at E8M0 for hardware simplicity.
Known accuracy limitation¶
Constraining scales strictly to powers of two causes a noticeable accuracy drop at MXFP4. Dropbox shows that this loss can largely be mitigated by simple post-training adjustments, restoring most of the original model quality, as demonstrated in the Dropbox FP4 blog post (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
NVFP4 — NVIDIA's refinement¶
NVIDIA introduced NVFP4 as an alternative to MXFP4 with two refinements:
- Smaller group size: 16 (vs MXFP's 32). Finer-grained scaling limits outlier impact.
- E4M3 FP8 scale format (vs E8M0). FP8 provides higher precision for the scale itself; because FP8 has a relatively large minimum representable value, a global per-tensor floating-point multiplier normalizes the scaling range, achieving improved numerical stability.
NVFP4 trades metadata overhead for better numerical stability than MXFP4 — useful at very low bit widths.
Hardware portability caveat¶
Although MXFP4 and NVFP4 are standardized formats, the instruction-level implementation is architecture-specific:
- sm_100 uses
tcgen05.mmawithblock_scalemodifier - sm_120 uses
mma.syncwithblock_scalemodifier
Kernels compiled for one are not portable to the other. Dropbox notes that Triton has recently added MXFP support on sm_120 — enabling cross-device support for low-bit Triton kernels like Dropbox's gemlite (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Ecosystem status (2026-02)¶
- Hardware support: NVIDIA Blackwell (B200, B300) is the production target; consumer/workstation sm_120 parts emerging.
- Runtime support: mainstream AI stack focused on server-grade B200/B300; Triton cross-device work recent but promising.
- Model availability: FP4 models are not yet widely available — training + publishing FP4-ready models is still an active area; Dropbox notes this as a production-viability bottleneck alongside runtime support.
Relationship to AWQ/HQQ (pre-MXFP)¶
MXFP quantizes both activations and weights using a micro-scaling approach similar in spirit to AWQ / HQQ — linear quantization with grouping (see patterns/grouped-linear-quantization) — but implemented directly in hardware rather than in software. AWQ/HQQ remain the right tool for older GPUs without MXFP support; MXFP is the forward path on Blackwell+.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox's landscape survey; canonical articulation of the pre-MXFP vs MXFP split, the MX dtype family (MXFP8/6/4 + MXINT8), E8M0 scaling, mixed-precision MMA support, and the portability caveats across sm_100 / sm_120 that Triton is actively closing.