SYSTEM Cited by 1 source
JDK Vector API¶
The JDK Vector API is an incubating Java feature
(jdk.incubator.vector) that provides a portable way to
express SIMD operations in Java —
"SIMD without intrinsics." Source code is written in terms of
vectors and lanes; the JIT maps those operations to the
widest SIMD instructions available on the host CPU (SSE / AVX2 /
AVX-512 on x86; NEON / SVE on ARM; scalar fallback where
unsupported) (Source:
sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).
Why it exists¶
Before the Vector API, getting SIMD throughput from the JVM meant:
- Hoping the HotSpot auto-vectorizer recognized your loop shape (unpredictable, version-sensitive).
- Writing a native library + accepting
JNI transition overhead
(
netlib-java→ BLAS path). - Dropping to
sun.misc.Unsafeor architecture-specific intrinsics (non-portable).
The Vector API makes SIMD a first-class Java API with a runtime-chosen lane width, so the same source compiles to AVX-512 on modern Intel, AVX2 on older x86, NEON on ARM, and falls back to scalar on anything else — no native build, no JNI.
Core shape¶
DoubleVector acc = DoubleVector.zero(SPECIES);
for (int k = 0; k + SPECIES.length() <= D; k += SPECIES.length()) {
DoubleVector a = DoubleVector.fromArray(SPECIES, bufA, i*D + k);
DoubleVector b = DoubleVector.fromArray(SPECIES, bufB, j*D + k);
acc = a.fma(b, acc); // fused multiply-add (one instruction)
}
double dot = acc.reduceLanes(VectorOperators.ADD);
// tail k..D-1 handled scalar
Load-bearing pieces:
DoubleVector.SPECIES_PREFERRED— a species descriptor that encodes the native lane width. Runtime-chosen: 4 doubles on AVX2, 8 doubles on AVX-512, scalar fallback elsewhere. Same source, different instruction widths on different hardware.fromArray(SPECIES, array, offset)— loads one lane-width chunk from a flat Java array. Pairs well with flat buffers from patterns/flat-buffer-threadlocal-reuse — no pointer chasing, predictable contiguous access.a.fma(b, acc)— fused multiply-add: computesa * b + accin one hardware instruction with one rounding step. Compiles tovfmadd231pdor similar on x86 with FMA3.reduceLanes(ADD)— horizontal reduction across the lanes of the accumulator to yield a scalar.- Tail handling — if
D % SPECIES.length() != 0, the remaining elements are handled in a scalar tail. Correctness is preserved without sacrificing throughput on the main loop.
Runtime flag + incubator status¶
The Vector API ships behind the incubator module flag:
As an incubating feature:
- The API may still change before graduation to a final feature.
- Services that depend on it must accept either (a) pinning the JDK version to match the incubator version, or (b) runtime capability dispatch with scalar fallback when the module isn't loaded.
- Netflix (systems/netflix-ranker) picked (b): at class load
time, probe for
jdk.incubator.vector; if present, use the Vector API kernel; if absent, use a loop-unrolled scalar dot product. "Services can opt in to the Vector API for maximum performance, but the system remains safe and predictable without it."
What the Vector API replaces vs. what it complements¶
Replaces (in Netflix's case):
- BLAS via JNI —
netlib-java+ native BLAS lost in the full pipeline to JNI setup costs, layout translation, and extra allocations/copies. "BLAS was still a useful experiment — it clarified where time was being spent, but it wasn't the drop-in win we wanted." - Naive scalar matmul — the scalar kernel shipped in the first batched version of the Ranker serendipity-scoring path regressed 5% against the nested-loop baseline because it couldn't exploit SIMD.
Complements:
- Flat
double[]buffers — the Vector API's load/store operations are keyed on contiguous Java arrays.double[][]has pointer chasing per row; flatdouble[M*D]has none. patterns/flat-buffer-threadlocal-reuse is the enabling memory-layout pattern. - Lucene-style scalar fallback — Netflix's scalar path is
inspired by Lucene's
VectorUtilDefaultProvider(highly-optimized loop-unrolled dot product). The Vector API kernel and the scalar kernel live behind the same factory interface.
Comparison to CPU/GPU MMA primitives¶
Neighbouring hardware primitive:
MMA on GPU Tensor
Cores issues C ← A × B + C at fixed tile sizes as a single
instruction. The Vector API's fma() is the per-lane CPU
analog — a fused a*b + acc at whatever the host's lane width
is. Not an MMA (no tile size, no block-scaled modifier), but the
same arithmetic primitive at narrower granularity.
Caveats¶
- Still incubating — API surface can change. Production use (Netflix on Ranker) predates stabilization; the scalar fallback path is a hedge against the API changing under Netflix.
- Lane-width variance matters. On CPUs with only SSE,
SPECIES_PREFERREDis 2 doubles; on AVX-512 it's 8. Kernel efficiency varies 4× across hardware on the same source code — a capacity-planning input. - Runtime flag requirement. Forgetting
--add-modules=jdk.incubator.vectorturns the module into a load-time failure; the capability-dispatch pattern handles it gracefully. - Auto-vectorizer interaction. Code written for the Vector API isn't the same as auto-vectorized code: explicit SIMD in the source prevents the auto-vectorizer from pattern-matching the loop. Usually net-positive (predictable perf) but a named trade-off.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api
— canonical wiki introduction. Netflix uses
DoubleVector.SPECIES_PREFERRED+fma()+reduceLanes(ADD)in the inner loop of Ranker's batched serendipity-scoring matmul on AVX-512 hardware; assembly-level win vs. loop-unrolled scalar confirmed. ~7% Ranker CPU drop at the node level; hot-path operator CPU share dropped from 7.5% → ~1%.
Related¶
- concepts/simd-vectorization — hardware primitive the Vector API exposes portably.
- concepts/fused-multiply-add — the per-lane instruction the
Vector API's
fma()compiles to. - concepts/matrix-multiplication-accumulate — GPU Tensor
Cores' fixed-tile MMA; the Vector API's
fma()is the narrower CPU analog. - patterns/runtime-capability-dispatch-pure-java-simd — deployment-safety pattern for services adopting an incubating SIMD API.
- patterns/flat-buffer-threadlocal-reuse — memory-layout pattern that pairs with the Vector API's contiguous-array load/store.
- systems/lucene — source of the loop-unrolled scalar
fallback (
VectorUtilDefaultProvider). - systems/netflix-ranker — canonical production consumer on the wiki.
- companies/netflix — parent.