CONCEPT Cited by 1 source

Matrix multiply-accumulate (MMA)¶

Definition¶

Matrix multiply-accumulate (MMA) is the fused primitive C ← A × B + C at fixed hardware-defined tile sizes, exposed by NVIDIA Tensor Cores and AMD Matrix Cores via dedicated instructions. It's the dominant cost in attention-based model inference — both the linear layers (attention block projections, MLPs, final output) and the attention mechanism itself reduce to chains of MMAs (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Hardware contract¶

MMA instructions are the narrow interface between quantization format design and hardware throughput. Format ecosystems succeed or fail on whether MMAs can consume them:

FP16 / BF16 — long-standing Tensor Core support; the baseline for attention-based model training/inference.
FP8 — supported on Hopper + later; ~2× FLOPS vs FP16.
FP4 — supported on Blackwell (sm_100+); ~2× FLOPS vs FP8.
MXFP / NVFP — hardware-native quantized formats with block_scale modifier on the MMA instruction; see patterns/hardware-native-quantization.

As precision is halved, these cores perform roughly twice the matrix operations per second — the core economic engine behind concepts/low-bit-inference.

Instruction variants across compute capabilities¶

Different NVIDIA compute capabilities expose MMA through different instructions, and kernel portability is constrained:

sm_100 (Blackwell datacenter) — tcgen05.mma is the block-scaled MMA instruction; used for MXFP/NVFP workloads on B200 / B300.
sm_120 — mma.sync with the block_scale modifier is the analogous form.

Both include block_scale to fuse MXFP scaling metadata directly into the matrix operation. Kernels compiled for sm_100 aren't portable to sm_120 and vice versa due to instruction-level differences (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai). Triton added MXFP support on sm_120, improving cross-device portability for low-bit Triton kernels.

Why pre-MXFP quantization paid a tax¶

Under pre-MXFP formats with different bit-widths for activations and weights (e.g. FP16 × 4-bit-int), the kernel had to dequantize the 4-bit weights up to FP16 before issuing the MMA. The MMA still ran at FP16 throughput; the dequant was pure arithmetic overhead on the MMA path. In memory-bound regimes this was a net win (less data moved); in compute-bound regimes it could be a net loss (MMA wasn't narrower, but dequant stole cycles).

MXFP's architectural fix is to fuse scaling into the MMA so the instruction itself consumes packed low-bit operands plus shared scales, eliminating the software dequant step. That's what patterns/hardware-native-quantization achieves.

Seen in¶

sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — MMAs are the workload, and quantization formats live or die on Tensor Core MMA support; canonical source for the pre-MXFP vs MXFP MMA-integration split, instruction-level portability caveats, and attention's use of 8-bit MMA in Flash Attention 3 / Sage Attention.