Skip to content

SYSTEM Cited by 1 source

JDK Vector API

The JDK Vector API is an incubating Java feature (jdk.incubator.vector) that provides a portable way to express SIMD operations in Java — "SIMD without intrinsics." Source code is written in terms of vectors and lanes; the JIT maps those operations to the widest SIMD instructions available on the host CPU (SSE / AVX2 / AVX-512 on x86; NEON / SVE on ARM; scalar fallback where unsupported) (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

Why it exists

Before the Vector API, getting SIMD throughput from the JVM meant:

  • Hoping the HotSpot auto-vectorizer recognized your loop shape (unpredictable, version-sensitive).
  • Writing a native library + accepting JNI transition overhead (netlib-java → BLAS path).
  • Dropping to sun.misc.Unsafe or architecture-specific intrinsics (non-portable).

The Vector API makes SIMD a first-class Java API with a runtime-chosen lane width, so the same source compiles to AVX-512 on modern Intel, AVX2 on older x86, NEON on ARM, and falls back to scalar on anything else — no native build, no JNI.

Core shape

DoubleVector acc = DoubleVector.zero(SPECIES);
for (int k = 0; k + SPECIES.length() <= D; k += SPECIES.length()) {
  DoubleVector a = DoubleVector.fromArray(SPECIES, bufA, i*D + k);
  DoubleVector b = DoubleVector.fromArray(SPECIES, bufB, j*D + k);
  acc = a.fma(b, acc);  // fused multiply-add (one instruction)
}
double dot = acc.reduceLanes(VectorOperators.ADD);
// tail k..D-1 handled scalar

Load-bearing pieces:

  • DoubleVector.SPECIES_PREFERRED — a species descriptor that encodes the native lane width. Runtime-chosen: 4 doubles on AVX2, 8 doubles on AVX-512, scalar fallback elsewhere. Same source, different instruction widths on different hardware.
  • fromArray(SPECIES, array, offset) — loads one lane-width chunk from a flat Java array. Pairs well with flat buffers from patterns/flat-buffer-threadlocal-reuse — no pointer chasing, predictable contiguous access.
  • a.fma(b, acc)fused multiply-add: computes a * b + acc in one hardware instruction with one rounding step. Compiles to vfmadd231pd or similar on x86 with FMA3.
  • reduceLanes(ADD) — horizontal reduction across the lanes of the accumulator to yield a scalar.
  • Tail handling — if D % SPECIES.length() != 0, the remaining elements are handled in a scalar tail. Correctness is preserved without sacrificing throughput on the main loop.

Runtime flag + incubator status

The Vector API ships behind the incubator module flag:

--add-modules=jdk.incubator.vector

As an incubating feature:

  • The API may still change before graduation to a final feature.
  • Services that depend on it must accept either (a) pinning the JDK version to match the incubator version, or (b) runtime capability dispatch with scalar fallback when the module isn't loaded.
  • Netflix (systems/netflix-ranker) picked (b): at class load time, probe for jdk.incubator.vector; if present, use the Vector API kernel; if absent, use a loop-unrolled scalar dot product. "Services can opt in to the Vector API for maximum performance, but the system remains safe and predictable without it."

What the Vector API replaces vs. what it complements

Replaces (in Netflix's case):

  • BLAS via JNInetlib-java + native BLAS lost in the full pipeline to JNI setup costs, layout translation, and extra allocations/copies. "BLAS was still a useful experiment — it clarified where time was being spent, but it wasn't the drop-in win we wanted."
  • Naive scalar matmul — the scalar kernel shipped in the first batched version of the Ranker serendipity-scoring path regressed 5% against the nested-loop baseline because it couldn't exploit SIMD.

Complements:

  • Flat double[] buffers — the Vector API's load/store operations are keyed on contiguous Java arrays. double[][] has pointer chasing per row; flat double[M*D] has none. patterns/flat-buffer-threadlocal-reuse is the enabling memory-layout pattern.
  • Lucene-style scalar fallback — Netflix's scalar path is inspired by Lucene's VectorUtilDefaultProvider (highly-optimized loop-unrolled dot product). The Vector API kernel and the scalar kernel live behind the same factory interface.

Comparison to CPU/GPU MMA primitives

Neighbouring hardware primitive: MMA on GPU Tensor Cores issues C ← A × B + C at fixed tile sizes as a single instruction. The Vector API's fma() is the per-lane CPU analog — a fused a*b + acc at whatever the host's lane width is. Not an MMA (no tile size, no block-scaled modifier), but the same arithmetic primitive at narrower granularity.

Caveats

  • Still incubating — API surface can change. Production use (Netflix on Ranker) predates stabilization; the scalar fallback path is a hedge against the API changing under Netflix.
  • Lane-width variance matters. On CPUs with only SSE, SPECIES_PREFERRED is 2 doubles; on AVX-512 it's 8. Kernel efficiency varies 4× across hardware on the same source code — a capacity-planning input.
  • Runtime flag requirement. Forgetting --add-modules=jdk.incubator.vector turns the module into a load-time failure; the capability-dispatch pattern handles it gracefully.
  • Auto-vectorizer interaction. Code written for the Vector API isn't the same as auto-vectorized code: explicit SIMD in the source prevents the auto-vectorizer from pattern-matching the loop. Usually net-positive (predictable perf) but a named trade-off.

Seen in

  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — canonical wiki introduction. Netflix uses DoubleVector.SPECIES_PREFERRED + fma() + reduceLanes(ADD) in the inner loop of Ranker's batched serendipity-scoring matmul on AVX-512 hardware; assembly-level win vs. loop-unrolled scalar confirmed. ~7% Ranker CPU drop at the node level; hot-path operator CPU share dropped from 7.5% → ~1%.
Last updated · 319 distilled / 1,201 read