PATTERN Cited by 1 source
Flat-buffer + ThreadLocal reuse¶
Problem¶
A hot path needs short-lived working buffers — input matrices,
scratch space, output accumulators — on every request. The natural
JVM idiom of allocating fresh double[M][D] arrays per call has two
compounding costs:
- GC pressure. Short-lived, large allocations fill the young generation quickly, triggering frequent minor collections that steal CPU from the hot path and cause latency jitter.
- Non-contiguous memory layout.
double[][]in the JVM is a reference array where each row is a separately-allocateddouble[]; walking across rows chases pointers through heap, defeating cache locality and preventing SIMD-friendly sequential access.
The combined effect: even a correct algorithm pays a structural overhead that can erase wins from better kernels (SIMD, BLAS, etc.).
Solution¶
Two co-operating changes:
1. Flatten to double[] row-major¶
Replace double[M][D] with a single flat double[] of length
M * D, indexed as buf[i * D + k] for row i, column k
(row-major). One allocation,
contiguous memory, predictable strides that vectorise cleanly.
2. Reuse per-thread via ThreadLocal<BufferHolder>¶
class BufferHolder {
double[] candidatesFlat = new double[0];
double[] historyFlat = new double[0];
double[] scratchFlat = new double[0];
double[] getCandidatesFlat(int required) {
if (candidatesFlat.length < required) {
candidatesFlat = new double[required];
}
return candidatesFlat;
}
// similar for historyFlat, scratchFlat
}
private static final ThreadLocal<BufferHolder> threadBuffers =
ThreadLocal.withInitial(BufferHolder::new);
Key properties:
- Grow-but-never-shrink. Buffers expand to the largest request this thread has seen, then stay sized. Steady-state allocation is zero; peak allocation is one array per buffer per thread.
- Per-thread ownership. No cross-thread contention — each thread
reads and writes its own
BufferHolderwithout locks. - No explicit cleanup. Buffers live as long as the thread; sized to workload, not to request.
When this is worth it¶
- Hot paths with predictable buffer sizes (e.g. matrix dimensions driven by embedding size + batch size).
- Pipelines that already make SIMD-style vectorisation available —
the flat-buffer rewrite enables SIMD speedups that
double[][]prevents. - Services where GC-induced tail-latency jitter is visible in p99 or p99.9 metrics.
Caveats¶
ThreadLocalretention on pooled-thread executors. In frameworks with long-lived worker threads (Tomcat, Netty, Spring Boot),ThreadLocalbuffers can sum to substantial heap usage across a fleet of threads. Size peak workload × thread pool size and verify it fits the heap.- Not useful for irregular workloads. If buffer sizes vary wildly call-to-call, the grow-never-shrink contract holds the peak allocation indefinitely — potentially worse than per-call allocation + collection.
- Escape-analysis may already be doing this. For small, clearly- scoped local allocations, the JIT may scalar-replace them; profile first.
- Virtual threads change the calculus. With Java 21 virtual
threads,
ThreadLocalis per-VT, not per-carrier, and the number of VTs can be orders of magnitude larger than the carrier pool — blowing the "peak per thread" budget. UseScopedValueor a carrier-thread-local alternative in VT-heavy services.
Seen in¶
- sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api
— Netflix Ranker's serendipity scoring hot path. The first-cut
batched implementation regressed ~5% because
double[][]per- batch allocations caused GC pressure and non-contiguous access. Replacing with flatdouble[]row-major buffers behind aThreadLocal<BufferHolder>eliminated per-request allocation, improved cache locality, and made the subsequent JDK Vector API kernel pay off. Netflix frames this step as the load-bearing substrate without which the kernel swap would have stayed a wash.
Related¶
- concepts/cache-locality — the axis flat buffers restore.
- concepts/row-vs-column-major-layout — which direction to pick when flattening.
- patterns/batched-matmul-for-pairwise-similarity — the algorithmic shape this pattern enables to pay off.
- patterns/runtime-capability-dispatch-pure-java-simd — the co-deployed pattern for SIMD kernels.
- systems/jdk-vector-api — the consumer of flat buffers Netflix eventually landed on.