Skip to content

CONCEPT Cited by 1 source

Matrix multiply-accumulate (MMA)

Definition

Matrix multiply-accumulate (MMA) is the fused primitive C ← A × B + C at fixed hardware-defined tile sizes, exposed by NVIDIA Tensor Cores and AMD Matrix Cores via dedicated instructions. It's the dominant cost in attention-based model inference — both the linear layers (attention block projections, MLPs, final output) and the attention mechanism itself reduce to chains of MMAs (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Hardware contract

MMA instructions are the narrow interface between quantization format design and hardware throughput. Format ecosystems succeed or fail on whether MMAs can consume them:

  • FP16 / BF16 — long-standing Tensor Core support; the baseline for attention-based model training/inference.
  • FP8 — supported on Hopper + later; ~2× FLOPS vs FP16.
  • FP4 — supported on Blackwell (sm_100+); ~2× FLOPS vs FP8.
  • MXFP / NVFP — hardware-native quantized formats with block_scale modifier on the MMA instruction; see patterns/hardware-native-quantization.

As precision is halved, these cores perform roughly twice the matrix operations per second — the core economic engine behind concepts/low-bit-inference.

Instruction variants across compute capabilities

Different NVIDIA compute capabilities expose MMA through different instructions, and kernel portability is constrained:

  • sm_100 (Blackwell datacenter)tcgen05.mma is the block-scaled MMA instruction; used for MXFP/NVFP workloads on B200 / B300.
  • sm_120mma.sync with the block_scale modifier is the analogous form.

Both include block_scale to fuse MXFP scaling metadata directly into the matrix operation. Kernels compiled for sm_100 aren't portable to sm_120 and vice versa due to instruction-level differences (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai). Triton added MXFP support on sm_120, improving cross-device portability for low-bit Triton kernels.

Why pre-MXFP quantization paid a tax

Under pre-MXFP formats with different bit-widths for activations and weights (e.g. FP16 × 4-bit-int), the kernel had to dequantize the 4-bit weights up to FP16 before issuing the MMA. The MMA still ran at FP16 throughput; the dequant was pure arithmetic overhead on the MMA path. In memory-bound regimes this was a net win (less data moved); in compute-bound regimes it could be a net loss (MMA wasn't narrower, but dequant stole cycles).

MXFP's architectural fix is to fuse scaling into the MMA so the instruction itself consumes packed low-bit operands plus shared scales, eliminating the software dequant step. That's what patterns/hardware-native-quantization achieves.

Seen in

Last updated · 200 distilled / 1,178 read