CONCEPT Cited by 1 source
Fused multiply-add (FMA)¶
Definition¶
FMA — fused multiply-add — computes a × b + c as a
single hardware instruction with a single rounding step
applied only to the final result. The IEEE 754-2008 standard
specifies FMA as fusedMultiplyAdd(a, b, c). On x86 this is
exposed via the FMA3 instruction set (vfmadd231pd,
vfmadd213pd, etc.); on ARM via VFPv4 / NEON / SVE equivalents.
Why it matters¶
Separate multiply and add:
produces two rounding errors. FMA:
produces one. For dot products — the inner loop of any matmul, convolution, or norm — accumulating with FMA is both more accurate and faster (one instruction, one cycle on modern cores with pipelined FMA units).
The throughput win on SIMD¶
Paired with SIMD, FMA operates on the full lane width:
- On AVX2,
vfmadd231pddoes 4 × (a×b + c) per instruction. - On AVX-512,
vfmadd231pddoes 8 × (a×b + c) per instruction.
Canonical dot-product inner loop (conceptually):
acc = zero_vector
for k in chunks of lane_width:
a = load_vector(A, k)
b = load_vector(B, k)
acc = fma(a, b, acc) // a*b + acc, per-lane, one instruction
dot = horizontal_sum(acc)
Two memory loads and one FMA per chunk — minimum possible arithmetic density for a dot product.
In the JDK Vector API¶
Java's JDK Vector API exposes FMA directly on vector types:
The JIT compiles this to the host CPU's native FMA instruction. Netflix Ranker's serendipity-scoring inner loop uses exactly this shape on AVX-512 Ranker hosts (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).
When you can't use FMA¶
- No FMA3 / VFPv4 instruction set. Older Atom / AMD Bulldozer pre-Piledriver lack FMA3. Pure-software emulation exists but is slower than a separate mul+add.
- Bit-exact reproducibility required against a non-FMA
reference. FMA changes the numeric result (fewer rounding
errors → not the same final bits as
(a*b) + c). Some regulated pipelines require bitwise match to a historical implementation and have to turn FMA off. - Languages that don't expose FMA. JavaScript's
Math.fmais non-standard and unimplemented in most engines; JavaScript numeric code can't match SIMD+FMA throughput without WebAssembly or equivalent.
Relation to MMA¶
MMA on GPU Tensor
Cores is FMA-at-tile-scale: C ← A × B + C on a fixed n×n
tile, one instruction, one rounding. Per-lane SIMD FMA is the
CPU analog at narrower granularity — one rounding per (a, b,
acc) lane triple. Both primitives are the same arithmetic
pattern optimized at different tile sizes.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api
— Netflix's Vector API kernel uses
a.fma(b, acc)as the core accumulator inside the batched-matmul dot-product inner loop on AVX-512 Ranker hosts. One instruction per lane-width chunk replaced many scalar multiply-adds; contributed to the 7.5% → ~1% per-operator CPU share drop on Ranker.
Related¶
- concepts/simd-vectorization — the host for FMA instructions in production hot paths.
- concepts/matrix-multiplication-accumulate — GPU Tensor Core MMA, the fixed-tile analog.
- concepts/cosine-similarity — FMA is the per-lane accumulator that makes SIMD cosine-similarity fast.
- systems/jdk-vector-api — Java API exposing FMA via
Vector.fma(v, acc). - companies/netflix — canonical wiki adopter via Ranker.