Skip to content

SYSTEM Cited by 1 source

MXFP microscaling formats

Definition

MXFP (Microscaling Formats) is an Open Compute Project spec (OCP Microscaling Formats MX Specification v1.0) standardizing a family of low-bit data types with native hardware support on GPU matrix units. MXFP types include MXFP8 / MXFP6 / MXFP4 (floating-point) and MXINT8 (integer), each using symmetric quantization with fixed block size 32 and a shared scaling factor per block in E8M0 format. Tensor Cores operate directly on packed MXFP operands + their block scales in a single fused MMA instruction — no software dequantization step (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why it matters

MXFP is the architectural break between pre-MXFP quantization (software-managed scaling + explicit dequantization before each MMA) and hardware-native quantization (MMAs consume quantized operands directly). Concretely:

  • Pre-MXFP: kernel loads 4-bit weights → unpacks → rescales to FP16 → MMA at FP16. The dequant is pure overhead on the MMA path; in compute-bound regimes it makes A16W4 slower than 16-bit matmul.
  • MXFP: kernel issues one block-scaled MMA instruction (tcgen05.mma on sm_100, mma.sync with block_scale on sm_120) that consumes MXFP operands + E8M0 scales + runs the MMA, all fused. The dequant tax is gone.

Format family

Per the OCP spec (Figure 4 in the Dropbox post, sourced from Table 1 of the spec PDF):

Type Element format Block size Scale format
MXFP8 FP8 (E5M2 or E4M3) 32 E8M0
MXFP6 FP6 (E3M2 or E2M3) 32 E8M0
MXFP4 FP4 (E2M1) 32 E8M0
MXINT8 INT8 32 E8M0

Mixed-precision MMA

MXFP explicitly supports mixed-precision MMA instructions on some hardware, e.g. MXFP8 × MXFP4. This gives practitioners a new axis: activations can run at MXFP8 / MXFP6 / MXFP4 while weights stay at MXFP4 — previously a pre-MXFP-only option (A16W4) that cost explicit dequant, now native in hardware (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

E8M0: the scale format

E8M0 = 8-bit exponent, 0-bit mantissa. Effectively a power-of-two scale: values in [2⁻¹²⁷, 2¹²⁷]. Scales are typically chosen as:

scale = weight.amax(axis=1, keepdim=True) / max_val

so scale values are mostly ≤ 1, and extremely small magnitudes are rarely needed (~2⁻¹⁵ typically sufficient for near-zero weights). The representable range is much wider than needed in practice, raising the theoretical question whether fewer scale bits would suffice — but the spec locks at E8M0 for hardware simplicity.

Known accuracy limitation

Constraining scales strictly to powers of two causes a noticeable accuracy drop at MXFP4. Dropbox shows that this loss can largely be mitigated by simple post-training adjustments, restoring most of the original model quality, as demonstrated in the Dropbox FP4 blog post (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

NVFP4 — NVIDIA's refinement

NVIDIA introduced NVFP4 as an alternative to MXFP4 with two refinements:

  1. Smaller group size: 16 (vs MXFP's 32). Finer-grained scaling limits outlier impact.
  2. E4M3 FP8 scale format (vs E8M0). FP8 provides higher precision for the scale itself; because FP8 has a relatively large minimum representable value, a global per-tensor floating-point multiplier normalizes the scaling range, achieving improved numerical stability.

NVFP4 trades metadata overhead for better numerical stability than MXFP4 — useful at very low bit widths.

Hardware portability caveat

Although MXFP4 and NVFP4 are standardized formats, the instruction-level implementation is architecture-specific:

  • sm_100 uses tcgen05.mma with block_scale modifier
  • sm_120 uses mma.sync with block_scale modifier

Kernels compiled for one are not portable to the other. Dropbox notes that Triton has recently added MXFP support on sm_120 — enabling cross-device support for low-bit Triton kernels like Dropbox's gemlite (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Ecosystem status (2026-02)

  • Hardware support: NVIDIA Blackwell (B200, B300) is the production target; consumer/workstation sm_120 parts emerging.
  • Runtime support: mainstream AI stack focused on server-grade B200/B300; Triton cross-device work recent but promising.
  • Model availability: FP4 models are not yet widely available — training + publishing FP4-ready models is still an active area; Dropbox notes this as a production-viability bottleneck alongside runtime support.

Relationship to AWQ/HQQ (pre-MXFP)

MXFP quantizes both activations and weights using a micro-scaling approach similar in spirit to AWQ / HQQ — linear quantization with grouping (see patterns/grouped-linear-quantization) — but implemented directly in hardware rather than in software. AWQ/HQQ remain the right tool for older GPUs without MXFP support; MXFP is the forward path on Blackwell+.

Seen in

Last updated · 200 distilled / 1,178 read