CONCEPT Cited by 1 source

SIMD vectorization¶

Definition¶

SIMD — Single Instruction, Multiple Data — is a CPU instruction class where one instruction performs the same arithmetic operation on multiple data elements in parallel, stored contiguously in a vector register. The number of elements per register is the lane width and depends on both register size and element type:

SSE: 128-bit registers → 4 × float, 2 × double
AVX2: 256-bit registers → 8 × float, 4 × double
AVX-512: 512-bit registers → 16 × float, 8 × double
NEON (ARM): 128-bit → 4 × float, 2 × double
SVE/SVE2 (ARM): variable-length, typically 128–2048-bit

A single vaddpd ymm0, ymm1, ymm2 on AVX2 adds 4 pairs of doubles in one instruction; the same source on AVX-512 adds 8 pairs via vaddpd zmm0, zmm1, zmm2. Throughput doubles without source changes.

Why it matters¶

SIMD is the cheapest form of parallelism a program can reach — no threads, no synchronization, no shared memory. Any loop that does the same operation on contiguous array elements is a candidate. In ML + numerics workloads (dot products, matrix multiplies, convolutions, softmax, norms), the inner loop is almost always a SIMD opportunity: 2× → 8× per-core throughput depending on register width.

Three access paths in practice¶

Compiler auto-vectorization. The compiler pattern-matches a scalar loop and emits SIMD. Works for textbook shapes; fails unpredictably on complex control flow, unclear aliasing, non-contiguous access, or operations the compiler doesn't have an intrinsic for. C / C++ / Rust / HotSpot all have auto-vectorizers of varying strength.
Explicit intrinsics or inline assembly. Platform-specific (_mm256_fmadd_pd for AVX2 FMA), gives full control, loses portability.
Portable SIMD APIs. The middle ground: write in terms of abstract lane-counted vectors, let the runtime or compiler map to the host's widest SIMD. Java's JDK Vector API is this shape — DoubleVector.SPECIES_PREFERRED picks the lane width at runtime, fromArray / fma / reduceLanes compile to host SIMD instructions.

Canonical wiki instance (Netflix Ranker)¶

Netflix's Ranker serendipity-scoring hot path uses the JDK Vector API to compute dot products inside a batched matmul. DoubleVector.SPECIES_PREFERRED delivers 4 lanes on AVX2 and 8 lanes on AVX-512 from the same source code; the inner loop uses FMA to accumulate a * b + acc in one instruction per lane width. "What used to be many scalar multiply-adds becomes a smaller number of vector fma() operations plus a reduction — same algorithm, much better use of the CPU's vector units." (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

Net result at the assembly level: "from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware." Operator CPU share: 7.5% → ~1%; node CPU: ~7% drop.

Preconditions for SIMD to pay off¶

Contiguous memory layout. Pointer-chasing between elements defeats SIMD. double[][] in Java has a pointer per row; flat double[M*D] is row-major contiguous. Netflix's flat-buffer rework (patterns/flat-buffer-threadlocal-reuse) was the enabling move for the Vector API to win on cosine similarity.
Aligned or unaligned-with-penalty access. Modern x86 makes the alignment penalty small; older hardware still cares.
Lane-width-divisible loop bounds. Tail iterations need a scalar cleanup when N % lane_width != 0.
Independent arithmetic. Dependencies across lanes (a[i] = a[i-1] + x) serialize and kill SIMD. Reductions need a horizontal step (reduceLanes in the Vector API).

FMA — per-lane a*b + c with one rounding step; the load-bearing arithmetic primitive behind dot-product SIMD loops.
MMA on GPU Tensor Cores is the fixed-tile-size analog of SIMD FMA on CPU. Both are fused C ← A×B + C; MMA operates on n×n tiles per instruction; SIMD-FMA operates on w-wide lanes per instruction.

Seen in¶

sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — canonical wiki instance. Netflix applies SIMD to video serendipity scoring via the JDK Vector API on AVX-512 / AVX2 Ranker hosts; flat-buffer memory layout + FMA + runtime lane-width selection all co-designed. Scalar fallback via loop-unrolling when the Vector API module isn't loaded.
sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — Yelp Nrtsearch 1.0.0 exposes Lucene 10's optional SIMD vector-instruction acceleration for HNSW kNN search, via Java 21's Vector API + foreign-memory API. Canonical SIMD-in- search-engine Seen-in on the wiki; complementary to the Netflix instance (which is SIMD in ML serving).

systems/jdk-vector-api — portable SIMD for the JVM.
concepts/fused-multiply-add — core arithmetic primitive used inside SIMD inner loops.
concepts/matrix-multiplication-accumulate — GPU counterpart at tile granularity.
patterns/runtime-capability-dispatch-pure-java-simd — deployment-safety pattern for SIMD with scalar fallback.
patterns/flat-buffer-threadlocal-reuse — memory-layout enabler.
companies/netflix — canonical wiki adopter.