PATTERN Cited by 1 source
Batched matmul for pairwise similarity¶
Problem¶
A service needs to score every candidate against every reference
item via a per-pair similarity function — cosine, dot product,
Euclidean, etc. The naive implementation is a nested loop over M
candidates and N reference items computing M × N independent
per-pair kernels. Even when each kernel is cheap, the
pattern is expensive at scale:
- Repeated lookups of the same reference embedding across candidates defeat cache locality.
- Scattered memory access across non-contiguous embedding storage prevents SIMD vectorisation.
- Per-pair function-call overhead dominates when
D(the embedding dimension) is small-to-moderate. - The JIT can't hoist much — each pair is its own tiny computation.
Solution¶
Reshape the M × N per-pair computation into a single matrix
multiply:
- Stack all candidate embeddings into a dense matrix
Aof shapeM × D. - Stack all reference embeddings into a dense matrix
Bof shapeN × D. - Unit-normalise both (
‖row‖ = 1) if the kernel is cosine similarity; normalisation is now amortised once per matrix, not once per pair. - Compute
C = A × Bᵀ.C[i][j]is the cosine similarity between candidateiand referencej. - Reduce
Cper row to get the per-candidate aggregate (max, top-k, etc.).
C = A × Bᵀ is what CPUs and GPUs are built to do fast — an
architecture-optimised path exists for it on every platform via
concepts/matrix-multiplication-accumulate hardware primitives
(CPU SIMD + FMA, GPU tensor cores).
Why this helps¶
- Cache-friendly. Both matrices are contiguous; rows fit in L1/L2 and stream sequentially through SIMD units.
- SIMD-native. The inner loop is
Dparallel multiply-adds with trivial data dependence — exactly the shape vector hardware accelerates. A singlefmainstruction accumulates 4 (AVX2) or 8 (AVX-512) doubles per step. - Amortised overhead. Any per-call setup (thread handoff, JIT
warm-up, layout conversion) pays once per batch instead of
M × Ntimes. - Composable with downstream steps. The output matrix
Cis itself amenable to further vectorised operations (row-max for top-1, row-softmax for attention, threshold filtering).
Caveats¶
- Batching overhead dominates if
MorNis small. For a single candidate against a handful of references, the per-pair loop may still win. Netflix's Ranker keeps the single-item implementation for the ~98% of requests that ship one candidate. - The reshape alone is not enough. The first implementation must
co-design memory layout — flat row-major buffers, not
double[][]— and allocation strategy (patterns/flat-buffer-threadlocal-reuse), or the per-batch allocation + GC pressure can erase the kernel win. Netflix measured a ~5% regression on its first cut of batching precisely for this reason. - Library choice matters. Netflix tried BLAS via
netlib-javaand lost to JNI transition overhead + row-vs-column- major translation. Pure-Java SIMD via JDK Vector API won instead.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api
— Netflix Ranker's video serendipity scoring reshaped from
O(M × N)per-pair cosine similarities toC = A × Bᵀ. Combined with flat-buffer rewrite + Vector API kernel, drove the feature's CPU share from 7.5% → ~1% of total Ranker node CPU.
Related¶
- concepts/cosine-similarity — the canonical per-pair kernel this pattern batches.
- concepts/matrix-multiplication-accumulate — the underlying hardware primitive.
- concepts/fused-multiply-add — the per-lane SIMD instruction.
- concepts/vector-embedding — the input to each row.
- patterns/flat-buffer-threadlocal-reuse — the enabling memory layout; pairing is load-bearing.
- patterns/runtime-capability-dispatch-pure-java-simd — how to deploy the SIMD kernel safely when the API is incubating.
- systems/jdk-vector-api — Netflix's chosen SIMD substrate.