Skip to content

PATTERN Cited by 1 source

Runtime capability dispatch — pure-Java SIMD

Problem

A service wants the performance of SIMD-accelerated math kernels but can't assume the SIMD capability is available at deploy time:

  • The JDK Vector API is still an incubating feature — enabling it requires the runtime flag --add-modules=jdk.incubator.vector. If a container, staging environment, or operator change drops the flag, the code must still run correctly.
  • Different hosts in the fleet may have different CPU capabilities (AVX2 vs AVX-512 vs NEON vs no SIMD). The kernel shouldn't crash or silently misbehave on the less-capable ones.
  • Long-tail JVM versions in production may lack the Vector API entirely.

The service owner wants: "opt-in to the Vector API for maximum performance, but remain safe and predictable without it."

Solution

Detect the SIMD capability at class load, bind the implementation via a factory, and ship a high-quality scalar fallback.

interface MatMul {
    void compute(double[] A, double[] B, double[] C,
                 int M, int N, int D);
}

class MatMulFactory {
    static final MatMul INSTANCE = create();

    private static MatMul create() {
        try {
            Class.forName("jdk.incubator.vector.DoubleVector");
            return new VectorApiMatMul();
        } catch (ClassNotFoundException | NoClassDefFoundError e) {
            return new ScalarMatMul();
        }
    }
}

Three deployment properties this enables

  1. Correctness in every environment. The scalar fallback is the functional contract; the Vector API path is an opportunistic speedup. Operator error with JVM flags can't break the service.
  2. Drop-in upgrade path. When the Vector API graduates out of incubation, the factory swaps to the stable-API class without the service-owning team touching anything.
  3. Per-host adaptation. The Vector API's own DoubleVector.SPECIES_PREFERRED picks the widest lane width available on the host (4 doubles on AVX2, 8 on AVX-512, NEON on ARM) — no platform-specific code in the service.

Scalar fallback quality matters

The fallback isn't just "the naive nested loop." Netflix's scalar path is a hand-optimised loop-unrolled dot product inspired by Lucene's VectorUtilDefaultProvider. Two reasons to invest:

  • In failure modes where the Vector API isn't available, the service should still perform well — not fall off a cliff.
  • The scalar path is the correctness oracle during development; keeping it well-tuned catches bugs where the vector path miscomputes tail elements or boundary cases.

Why this beats JNI-based alternatives

Alternative capability-dispatch strategies exist outside pure Java — native kernels compiled per-platform behind a JNI bridge, or jextract- generated FFI. They fail the "safe fallback" property worse than pure-Java dispatch:

  • JNI adds transition overhead on every call.
  • Native kernels need per-platform builds + shipping, plus a fallback anyway for unsupported platforms.
  • The fallback itself is a different language/toolchain than the hot path, fragmenting the build + testing surface.

Pure-Java SIMD + scalar fallback keeps the entire pipeline in one language with one build.

Caveats

  • Class-load dispatch is one-shot. The factory decides once per process; dynamic CPU feature additions (hotplug, VM migration) are not accommodated. For server workloads this is almost always fine.
  • Version pinning the detection class. Using Class.forName on a concrete incubator class couples the detection to the current class name; when the API graduates, the detector must update.
  • Benchmark discipline. Scalar and vector kernels must be benchmarked with the same memory-layout assumptions (patterns/flat-buffer-threadlocal-reuse) — otherwise dispatch can inadvertently mask a regression on the scalar path.

Seen in

  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker binds MatMulFactory at class load: if jdk.incubator.vector is present, uses a Vector API matmul with fma() accumulators; otherwise falls back to a Lucene-inspired scalar loop-unrolled dot product. Single-video requests continue on the per-item implementation unchanged. Netflix frames the fallback as a production safety property, not just a performance detail.
Last updated · 319 distilled / 1,201 read