Skip to content

PATTERN Cited by 1 source

Flat-buffer + ThreadLocal reuse

Problem

A hot path needs short-lived working buffers — input matrices, scratch space, output accumulators — on every request. The natural JVM idiom of allocating fresh double[M][D] arrays per call has two compounding costs:

  1. GC pressure. Short-lived, large allocations fill the young generation quickly, triggering frequent minor collections that steal CPU from the hot path and cause latency jitter.
  2. Non-contiguous memory layout. double[][] in the JVM is a reference array where each row is a separately-allocated double[]; walking across rows chases pointers through heap, defeating cache locality and preventing SIMD-friendly sequential access.

The combined effect: even a correct algorithm pays a structural overhead that can erase wins from better kernels (SIMD, BLAS, etc.).

Solution

Two co-operating changes:

1. Flatten to double[] row-major

Replace double[M][D] with a single flat double[] of length M * D, indexed as buf[i * D + k] for row i, column k (row-major). One allocation, contiguous memory, predictable strides that vectorise cleanly.

2. Reuse per-thread via ThreadLocal<BufferHolder>

class BufferHolder {
  double[] candidatesFlat = new double[0];
  double[] historyFlat = new double[0];
  double[] scratchFlat = new double[0];

  double[] getCandidatesFlat(int required) {
    if (candidatesFlat.length < required) {
      candidatesFlat = new double[required];
    }
    return candidatesFlat;
  }
  // similar for historyFlat, scratchFlat
}

private static final ThreadLocal<BufferHolder> threadBuffers =
    ThreadLocal.withInitial(BufferHolder::new);

Key properties:

  • Grow-but-never-shrink. Buffers expand to the largest request this thread has seen, then stay sized. Steady-state allocation is zero; peak allocation is one array per buffer per thread.
  • Per-thread ownership. No cross-thread contention — each thread reads and writes its own BufferHolder without locks.
  • No explicit cleanup. Buffers live as long as the thread; sized to workload, not to request.

When this is worth it

  • Hot paths with predictable buffer sizes (e.g. matrix dimensions driven by embedding size + batch size).
  • Pipelines that already make SIMD-style vectorisation available — the flat-buffer rewrite enables SIMD speedups that double[][] prevents.
  • Services where GC-induced tail-latency jitter is visible in p99 or p99.9 metrics.

Caveats

  • ThreadLocal retention on pooled-thread executors. In frameworks with long-lived worker threads (Tomcat, Netty, Spring Boot), ThreadLocal buffers can sum to substantial heap usage across a fleet of threads. Size peak workload × thread pool size and verify it fits the heap.
  • Not useful for irregular workloads. If buffer sizes vary wildly call-to-call, the grow-never-shrink contract holds the peak allocation indefinitely — potentially worse than per-call allocation + collection.
  • Escape-analysis may already be doing this. For small, clearly- scoped local allocations, the JIT may scalar-replace them; profile first.
  • Virtual threads change the calculus. With Java 21 virtual threads, ThreadLocal is per-VT, not per-carrier, and the number of VTs can be orders of magnitude larger than the carrier pool — blowing the "peak per thread" budget. Use ScopedValue or a carrier-thread-local alternative in VT-heavy services.

Seen in

  • sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api — Netflix Ranker's serendipity scoring hot path. The first-cut batched implementation regressed ~5% because double[][] per- batch allocations caused GC pressure and non-contiguous access. Replacing with flat double[] row-major buffers behind a ThreadLocal<BufferHolder> eliminated per-request allocation, improved cache locality, and made the subsequent JDK Vector API kernel pay off. Netflix frames this step as the load-bearing substrate without which the kernel swap would have stayed a wash.
Last updated · 319 distilled / 1,201 read