CONCEPT Cited by 1 source
Matrix multiply-accumulate (MMA)¶
Definition¶
Matrix multiply-accumulate (MMA) is the fused primitive C ←
A × B + C at fixed hardware-defined tile sizes, exposed by
NVIDIA Tensor Cores and AMD Matrix Cores via dedicated
instructions. It's the dominant cost in attention-based model
inference — both the linear layers (attention block projections,
MLPs, final output) and the attention mechanism itself reduce to
chains of MMAs (Source:
sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Hardware contract¶
MMA instructions are the narrow interface between quantization format design and hardware throughput. Format ecosystems succeed or fail on whether MMAs can consume them:
- FP16 / BF16 — long-standing Tensor Core support; the baseline for attention-based model training/inference.
- FP8 — supported on Hopper + later; ~2× FLOPS vs FP16.
- FP4 — supported on Blackwell (sm_100+); ~2× FLOPS vs FP8.
- MXFP / NVFP — hardware-native
quantized formats with
block_scalemodifier on the MMA instruction; see patterns/hardware-native-quantization.
As precision is halved, these cores perform roughly twice the matrix operations per second — the core economic engine behind concepts/low-bit-inference.
Instruction variants across compute capabilities¶
Different NVIDIA compute capabilities expose MMA through different instructions, and kernel portability is constrained:
sm_100(Blackwell datacenter) —tcgen05.mmais the block-scaled MMA instruction; used for MXFP/NVFP workloads on B200 / B300.sm_120—mma.syncwith theblock_scalemodifier is the analogous form.
Both include block_scale to fuse MXFP
scaling metadata directly into the matrix operation. Kernels
compiled for sm_100 aren't portable to sm_120 and vice versa
due to instruction-level differences (Source:
sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Triton added MXFP support on
sm_120,
improving cross-device portability for low-bit Triton kernels.
Why pre-MXFP quantization paid a tax¶
Under pre-MXFP formats with different bit-widths for activations and weights (e.g. FP16 × 4-bit-int), the kernel had to dequantize the 4-bit weights up to FP16 before issuing the MMA. The MMA still ran at FP16 throughput; the dequant was pure arithmetic overhead on the MMA path. In memory-bound regimes this was a net win (less data moved); in compute-bound regimes it could be a net loss (MMA wasn't narrower, but dequant stole cycles).
MXFP's architectural fix is to fuse scaling into the MMA so the instruction itself consumes packed low-bit operands plus shared scales, eliminating the software dequant step. That's what patterns/hardware-native-quantization achieves.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — MMAs are the workload, and quantization formats live or die on Tensor Core MMA support; canonical source for the pre-MXFP vs MXFP MMA-integration split, instruction-level portability caveats, and attention's use of 8-bit MMA in Flash Attention 3 / Sage Attention.