PATTERN Cited by 1 source

Flat-buffer + ThreadLocal reuse¶

Problem¶

A hot path needs short-lived working buffers — input matrices, scratch space, output accumulators — on every request. The natural JVM idiom of allocating fresh double[M][D] arrays per call has two compounding costs:

GC pressure. Short-lived, large allocations fill the young generation quickly, triggering frequent minor collections that steal CPU from the hot path and cause latency jitter.
Non-contiguous memory layout. double[][] in the JVM is a reference array where each row is a separately-allocated double[]; walking across rows chases pointers through heap, defeating cache locality and preventing SIMD-friendly sequential access.

The combined effect: even a correct algorithm pays a structural overhead that can erase wins from better kernels (SIMD, BLAS, etc.).

Solution¶

Two co-operating changes:

1. Flatten to `double[]` row-major¶

Replace double[M][D] with a single flat double[] of length M * D, indexed as buf[i * D + k] for row i, column k (row-major). One allocation, contiguous memory, predictable strides that vectorise cleanly.

2. Reuse per-thread via `ThreadLocal<BufferHolder>`¶

class BufferHolder {
  double[] candidatesFlat = new double[0];
  double[] historyFlat = new double[0];
  double[] scratchFlat = new double[0];

  double[] getCandidatesFlat(int required) {
    if (candidatesFlat.length < required) {
      candidatesFlat = new double[required];
    }
    return candidatesFlat;
  }
  // similar for historyFlat, scratchFlat
}

private static final ThreadLocal<BufferHolder> threadBuffers =
    ThreadLocal.withInitial(BufferHolder::new);

Key properties:

Grow-but-never-shrink. Buffers expand to the largest request this thread has seen, then stay sized. Steady-state allocation is zero; peak allocation is one array per buffer per thread.
Per-thread ownership. No cross-thread contention — each thread reads and writes its own BufferHolder without locks.
No explicit cleanup. Buffers live as long as the thread; sized to workload, not to request.

When this is worth it¶

Hot paths with predictable buffer sizes (e.g. matrix dimensions driven by embedding size + batch size).
Pipelines that already make SIMD-style vectorisation available — the flat-buffer rewrite enables SIMD speedups that double[][] prevents.
Services where GC-induced tail-latency jitter is visible in p99 or p99.9 metrics.

Caveats¶

ThreadLocal retention on pooled-thread executors. In frameworks with long-lived worker threads (Tomcat, Netty, Spring Boot), ThreadLocal buffers can sum to substantial heap usage across a fleet of threads. Size peak workload × thread pool size and verify it fits the heap.
Not useful for irregular workloads. If buffer sizes vary wildly call-to-call, the grow-never-shrink contract holds the peak allocation indefinitely — potentially worse than per-call allocation + collection.
Escape-analysis may already be doing this. For small, clearly- scoped local allocations, the JIT may scalar-replace them; profile first.
Virtual threads change the calculus. With Java 21 virtual threads, ThreadLocal is per-VT, not per-carrier, and the number of VTs can be orders of magnitude larger than the carrier pool — blowing the "peak per thread" budget. Use ScopedValue or a carrier-thread-local alternative in VT-heavy services.

Seen in¶

sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker's serendipity scoring hot path. The first-cut batched implementation regressed ~5% because double[][] per- batch allocations caused GC pressure and non-contiguous access. Replacing with flat double[] row-major buffers behind a ThreadLocal<BufferHolder> eliminated per-request allocation, improved cache locality, and made the subsequent JDK Vector API kernel pay off. Netflix frames this step as the load-bearing substrate without which the kernel swap would have stayed a wash.

concepts/cache-locality — the axis flat buffers restore.
concepts/row-vs-column-major-layout — which direction to pick when flattening.
patterns/batched-matmul-for-pairwise-similarity — the algorithmic shape this pattern enables to pay off.
patterns/runtime-capability-dispatch-pure-java-simd — the co-deployed pattern for SIMD kernels.
systems/jdk-vector-api — the consumer of flat buffers Netflix eventually landed on.