Skip to content

CONCEPT Cited by 1 source

Fused multiply-add (FMA)

Definition

FMAfused multiply-add — computes a × b + c as a single hardware instruction with a single rounding step applied only to the final result. The IEEE 754-2008 standard specifies FMA as fusedMultiplyAdd(a, b, c). On x86 this is exposed via the FMA3 instruction set (vfmadd231pd, vfmadd213pd, etc.); on ARM via VFPv4 / NEON / SVE equivalents.

Why it matters

Separate multiply and add:

t = a * b    // round #1 (loses precision)
c = t + c    // round #2 (loses precision again)

produces two rounding errors. FMA:

c = a * b + c  // round once, at the end

produces one. For dot products — the inner loop of any matmul, convolution, or norm — accumulating with FMA is both more accurate and faster (one instruction, one cycle on modern cores with pipelined FMA units).

The throughput win on SIMD

Paired with SIMD, FMA operates on the full lane width:

  • On AVX2, vfmadd231pd does 4 × (a×b + c) per instruction.
  • On AVX-512, vfmadd231pd does 8 × (a×b + c) per instruction.

Canonical dot-product inner loop (conceptually):

acc = zero_vector
for k in chunks of lane_width:
    a = load_vector(A, k)
    b = load_vector(B, k)
    acc = fma(a, b, acc)    // a*b + acc, per-lane, one instruction
dot = horizontal_sum(acc)

Two memory loads and one FMA per chunk — minimum possible arithmetic density for a dot product.

In the JDK Vector API

Java's JDK Vector API exposes FMA directly on vector types:

acc = a.fma(b, acc);  // a * b + acc, per-lane, one rounding

The JIT compiles this to the host CPU's native FMA instruction. Netflix Ranker's serendipity-scoring inner loop uses exactly this shape on AVX-512 Ranker hosts (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

When you can't use FMA

  • No FMA3 / VFPv4 instruction set. Older Atom / AMD Bulldozer pre-Piledriver lack FMA3. Pure-software emulation exists but is slower than a separate mul+add.
  • Bit-exact reproducibility required against a non-FMA reference. FMA changes the numeric result (fewer rounding errors → not the same final bits as (a*b) + c). Some regulated pipelines require bitwise match to a historical implementation and have to turn FMA off.
  • Languages that don't expose FMA. JavaScript's Math.fma is non-standard and unimplemented in most engines; JavaScript numeric code can't match SIMD+FMA throughput without WebAssembly or equivalent.

Relation to MMA

MMA on GPU Tensor Cores is FMA-at-tile-scale: C ← A × B + C on a fixed n×n tile, one instruction, one rounding. Per-lane SIMD FMA is the CPU analog at narrower granularity — one rounding per (a, b, acc) lane triple. Both primitives are the same arithmetic pattern optimized at different tile sizes.

Seen in

Last updated · 319 distilled / 1,201 read