CONCEPT Cited by 1 source

Fused multiply-add (FMA)¶

Definition¶

FMA — fused multiply-add — computes a × b + c as a single hardware instruction with a single rounding step applied only to the final result. The IEEE 754-2008 standard specifies FMA as fusedMultiplyAdd(a, b, c). On x86 this is exposed via the FMA3 instruction set (vfmadd231pd, vfmadd213pd, etc.); on ARM via VFPv4 / NEON / SVE equivalents.

Why it matters¶

Separate multiply and add:

t = a * b    // round #1 (loses precision)
c = t + c    // round #2 (loses precision again)

produces two rounding errors. FMA:

c = a * b + c  // round once, at the end

produces one. For dot products — the inner loop of any matmul, convolution, or norm — accumulating with FMA is both more accurate and faster (one instruction, one cycle on modern cores with pipelined FMA units).

The throughput win on SIMD¶

Paired with SIMD, FMA operates on the full lane width:

On AVX2, vfmadd231pd does 4 × (a×b + c) per instruction.
On AVX-512, vfmadd231pd does 8 × (a×b + c) per instruction.

Canonical dot-product inner loop (conceptually):

acc = zero_vector
for k in chunks of lane_width:
    a = load_vector(A, k)
    b = load_vector(B, k)
    acc = fma(a, b, acc)    // a*b + acc, per-lane, one instruction
dot = horizontal_sum(acc)

Two memory loads and one FMA per chunk — minimum possible arithmetic density for a dot product.

In the JDK Vector API¶

Java's JDK Vector API exposes FMA directly on vector types:

acc = a.fma(b, acc);  // a * b + acc, per-lane, one rounding

The JIT compiles this to the host CPU's native FMA instruction. Netflix Ranker's serendipity-scoring inner loop uses exactly this shape on AVX-512 Ranker hosts (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

When you can't use FMA¶

No FMA3 / VFPv4 instruction set. Older Atom / AMD Bulldozer pre-Piledriver lack FMA3. Pure-software emulation exists but is slower than a separate mul+add.
Bit-exact reproducibility required against a non-FMA reference. FMA changes the numeric result (fewer rounding errors → not the same final bits as (a*b) + c). Some regulated pipelines require bitwise match to a historical implementation and have to turn FMA off.
Languages that don't expose FMA. JavaScript's Math.fma is non-standard and unimplemented in most engines; JavaScript numeric code can't match SIMD+FMA throughput without WebAssembly or equivalent.

Relation to MMA¶

MMA on GPU Tensor Cores is FMA-at-tile-scale: C ← A × B + C on a fixed n×n tile, one instruction, one rounding. Per-lane SIMD FMA is the CPU analog at narrower granularity — one rounding per (a, b, acc) lane triple. Both primitives are the same arithmetic pattern optimized at different tile sizes.

Seen in¶

sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix's Vector API kernel uses a.fma(b, acc) as the core accumulator inside the batched-matmul dot-product inner loop on AVX-512 Ranker hosts. One instruction per lane-width chunk replaced many scalar multiply-adds; contributed to the 7.5% → ~1% per-operator CPU share drop on Ranker.

concepts/simd-vectorization — the host for FMA instructions in production hot paths.
concepts/matrix-multiplication-accumulate — GPU Tensor Core MMA, the fixed-tile analog.
concepts/cosine-similarity — FMA is the per-lane accumulator that makes SIMD cosine-similarity fast.
systems/jdk-vector-api — Java API exposing FMA via Vector.fma(v, acc).
companies/netflix — canonical wiki adopter via Ranker.