Skip to content

CONCEPT Cited by 1 source

SIMD vectorization

Definition

SIMDSingle Instruction, Multiple Data — is a CPU instruction class where one instruction performs the same arithmetic operation on multiple data elements in parallel, stored contiguously in a vector register. The number of elements per register is the lane width and depends on both register size and element type:

  • SSE: 128-bit registers → 4 × float, 2 × double
  • AVX2: 256-bit registers → 8 × float, 4 × double
  • AVX-512: 512-bit registers → 16 × float, 8 × double
  • NEON (ARM): 128-bit → 4 × float, 2 × double
  • SVE/SVE2 (ARM): variable-length, typically 128–2048-bit

A single vaddpd ymm0, ymm1, ymm2 on AVX2 adds 4 pairs of doubles in one instruction; the same source on AVX-512 adds 8 pairs via vaddpd zmm0, zmm1, zmm2. Throughput doubles without source changes.

Why it matters

SIMD is the cheapest form of parallelism a program can reach — no threads, no synchronization, no shared memory. Any loop that does the same operation on contiguous array elements is a candidate. In ML + numerics workloads (dot products, matrix multiplies, convolutions, softmax, norms), the inner loop is almost always a SIMD opportunity: per-core throughput depending on register width.

Three access paths in practice

  1. Compiler auto-vectorization. The compiler pattern-matches a scalar loop and emits SIMD. Works for textbook shapes; fails unpredictably on complex control flow, unclear aliasing, non-contiguous access, or operations the compiler doesn't have an intrinsic for. C / C++ / Rust / HotSpot all have auto-vectorizers of varying strength.

  2. Explicit intrinsics or inline assembly. Platform-specific (_mm256_fmadd_pd for AVX2 FMA), gives full control, loses portability.

  3. Portable SIMD APIs. The middle ground: write in terms of abstract lane-counted vectors, let the runtime or compiler map to the host's widest SIMD. Java's JDK Vector API is this shape — DoubleVector.SPECIES_PREFERRED picks the lane width at runtime, fromArray / fma / reduceLanes compile to host SIMD instructions.

Canonical wiki instance (Netflix Ranker)

Netflix's Ranker serendipity-scoring hot path uses the JDK Vector API to compute dot products inside a batched matmul. DoubleVector.SPECIES_PREFERRED delivers 4 lanes on AVX2 and 8 lanes on AVX-512 from the same source code; the inner loop uses FMA to accumulate a * b + acc in one instruction per lane width. "What used to be many scalar multiply-adds becomes a smaller number of vector fma() operations plus a reduction — same algorithm, much better use of the CPU's vector units." (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

Net result at the assembly level: "from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware." Operator CPU share: 7.5% → ~1%; node CPU: ~7% drop.

Preconditions for SIMD to pay off

  • Contiguous memory layout. Pointer-chasing between elements defeats SIMD. double[][] in Java has a pointer per row; flat double[M*D] is row-major contiguous. Netflix's flat-buffer rework (patterns/flat-buffer-threadlocal-reuse) was the enabling move for the Vector API to win on cosine similarity.
  • Aligned or unaligned-with-penalty access. Modern x86 makes the alignment penalty small; older hardware still cares.
  • Lane-width-divisible loop bounds. Tail iterations need a scalar cleanup when N % lane_width != 0.
  • Independent arithmetic. Dependencies across lanes (a[i] = a[i-1] + x) serialize and kill SIMD. Reductions need a horizontal step (reduceLanes in the Vector API).
  • FMA — per-lane a*b + c with one rounding step; the load-bearing arithmetic primitive behind dot-product SIMD loops.
  • MMA on GPU Tensor Cores is the fixed-tile-size analog of SIMD FMA on CPU. Both are fused C ← A×B + C; MMA operates on n×n tiles per instruction; SIMD-FMA operates on w-wide lanes per instruction.

Seen in

Last updated · 319 distilled / 1,201 read