CONCEPT Cited by 1 source
SIMD vectorization¶
Definition¶
SIMD — Single Instruction, Multiple Data — is a CPU instruction class where one instruction performs the same arithmetic operation on multiple data elements in parallel, stored contiguously in a vector register. The number of elements per register is the lane width and depends on both register size and element type:
- SSE: 128-bit registers → 4 ×
float, 2 ×double - AVX2: 256-bit registers → 8 ×
float, 4 ×double - AVX-512: 512-bit registers → 16 ×
float, 8 ×double - NEON (ARM): 128-bit → 4 ×
float, 2 ×double - SVE/SVE2 (ARM): variable-length, typically 128–2048-bit
A single vaddpd ymm0, ymm1, ymm2 on AVX2 adds 4 pairs of
doubles in one instruction; the same source on AVX-512 adds 8
pairs via vaddpd zmm0, zmm1, zmm2. Throughput doubles without
source changes.
Why it matters¶
SIMD is the cheapest form of parallelism a program can reach —
no threads, no synchronization, no shared memory. Any loop that
does the same operation on contiguous array elements is a
candidate. In ML + numerics workloads (dot products, matrix
multiplies, convolutions, softmax, norms), the inner loop is
almost always a SIMD opportunity: 2× → 8× per-core
throughput depending on register width.
Three access paths in practice¶
-
Compiler auto-vectorization. The compiler pattern-matches a scalar loop and emits SIMD. Works for textbook shapes; fails unpredictably on complex control flow, unclear aliasing, non-contiguous access, or operations the compiler doesn't have an intrinsic for. C / C++ / Rust / HotSpot all have auto-vectorizers of varying strength.
-
Explicit intrinsics or inline assembly. Platform-specific (
_mm256_fmadd_pdfor AVX2 FMA), gives full control, loses portability. -
Portable SIMD APIs. The middle ground: write in terms of abstract lane-counted vectors, let the runtime or compiler map to the host's widest SIMD. Java's JDK Vector API is this shape —
DoubleVector.SPECIES_PREFERREDpicks the lane width at runtime,fromArray/fma/reduceLanescompile to host SIMD instructions.
Canonical wiki instance (Netflix Ranker)¶
Netflix's Ranker serendipity-scoring hot path uses the JDK
Vector API to compute dot products inside a batched matmul.
DoubleVector.SPECIES_PREFERRED delivers 4 lanes on AVX2 and 8
lanes on AVX-512 from the same source code; the inner loop uses
FMA to accumulate a * b + acc
in one instruction per lane width. "What used to be many scalar
multiply-adds becomes a smaller number of vector fma()
operations plus a reduction — same algorithm, much better use of
the CPU's vector units." (Source:
sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).
Net result at the assembly level: "from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware." Operator CPU share: 7.5% → ~1%; node CPU: ~7% drop.
Preconditions for SIMD to pay off¶
- Contiguous memory layout. Pointer-chasing between elements
defeats SIMD.
double[][]in Java has a pointer per row; flatdouble[M*D]is row-major contiguous. Netflix's flat-buffer rework (patterns/flat-buffer-threadlocal-reuse) was the enabling move for the Vector API to win on cosine similarity. - Aligned or unaligned-with-penalty access. Modern x86 makes the alignment penalty small; older hardware still cares.
- Lane-width-divisible loop bounds. Tail iterations need a
scalar cleanup when
N % lane_width != 0. - Independent arithmetic. Dependencies across lanes (
a[i] = a[i-1] + x) serialize and kill SIMD. Reductions need a horizontal step (reduceLanesin the Vector API).
Related instruction-level primitives¶
- FMA — per-lane
a*b + cwith one rounding step; the load-bearing arithmetic primitive behind dot-product SIMD loops. - MMA on GPU Tensor
Cores is the fixed-tile-size analog of SIMD FMA on CPU. Both
are fused
C ← A×B + C; MMA operates onn×ntiles per instruction; SIMD-FMA operates onw-wide lanes per instruction.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — canonical wiki instance. Netflix applies SIMD to video serendipity scoring via the JDK Vector API on AVX-512 / AVX2 Ranker hosts; flat-buffer memory layout + FMA + runtime lane-width selection all co-designed. Scalar fallback via loop-unrolling when the Vector API module isn't loaded.
Related¶
- systems/jdk-vector-api — portable SIMD for the JVM.
- concepts/fused-multiply-add — core arithmetic primitive used inside SIMD inner loops.
- concepts/matrix-multiplication-accumulate — GPU counterpart at tile granularity.
- patterns/runtime-capability-dispatch-pure-java-simd — deployment-safety pattern for SIMD with scalar fallback.
- patterns/flat-buffer-threadlocal-reuse — memory-layout enabler.
- companies/netflix — canonical wiki adopter.