Skip to content

PATTERN Cited by 1 source

Zero-allocation cache payload

Problem

On a low-latency JVM serving tier, GC pauses directly become tail latency. A 50 ms young-generation pause on a Netty NIO thread freezes every channel that thread owns. The dominant source of young-gen pressure at cache-heavy serving tiers is the cached value's object graph and the transient allocations created while assembling a response — decoded JSON trees, map/list wrappers, per-request byte-buffer copies, gunzip-then-regzip pipelines.

On Zalando's PRAPI (1,000-line product JSON, single-digit-ms P99 target), these allocations are the delta between "blazingly fast" and "ordinary fast."

Pattern

Apply the principle the post extracts:

"The best way to eliminate GC pauses is to avoid object allocation altogether." (Source: sources/2025-03-06-zalando-from-event-driven-chaos-to-a-blazingly-fast-serving-api.)

Two concrete tactics, both from PRAPI:

1. Cache single items as ByteArray, not ObjectNode

Instead of deserialising into a parsed object graph (com.fasterxml.jackson.databind.node.ObjectNode), keep the value as a raw byte array matching the on-wire bytes of the response:

  • One allocation per cache entry instead of a tree of N nodes.
  • Zero deserialisation cost on cache read.
  • Zero re-serialisation cost on response (write the bytes straight into the response buffer).

Trade-off: you lose ergonomic in-process transformation of the value. The serving-tier code operates on opaque bytes, so format transforms have to happen at a different layer.

2. Store pre-gzipped chunks in Okio buffers, concatenate for batch

For batch responses, instead of:

  • Decode each chunk from its gzipped form;
  • Merge them into one JSON array;
  • Re-gzip the result;

…keep each item already gzipped in an Okio Buffer. Concatenating Okio buffers is segment-pointer assignment, not byte copy. Gzip is self-concatenating — gzip(A) + gzip(B) is a valid gzip stream decodable as A + B. So the batch response is:

  • Iterate the requested item IDs.
  • For each, look up the already-gzipped chunk.
  • Append its Okio buffer to the response's Okio buffer (pointer, not copy).
  • Write the combined buffer straight to the wire.

Zero gunzip, zero re-gzip, zero intermediate ObjectNode/ArrayList allocation. The per-batch work is linear in item count at pointer-assignment cost.

Why this matters specifically on Netty

Netty's EventLoop threads serve many channels per thread. A GC pause on an EventLoop thread stalls every channel that thread owns. The two disciplines combine:

  • No blocking tasks on EventLoop — prescribed directly by the Netty model; PRAPI enforces it with JFR
  • JDK Mission Control inspection.
  • No allocations on EventLoop either, when avoidable — the zero-allocation cache payload is the discipline for the data the EventLoop touches most often.

Trade-offs

  • Transform cost moves elsewhere. If cache values are opaque bytes, any transformation (field projection, content-type negotiation) happens either before caching (pre-compute N formats) or after — not on read.
  • Memory bookkeeping is manual. Buffer pools and ByteArray lifetimes need explicit ownership. Leaked buffers are native-memory leaks (in Netty's direct buffer case).
  • Debug ergonomics worse. ByteArray contents aren't visible in a debugger the way ObjectNode is.

When it's overkill

  • Latency budget is tens of ms, not single-digit ms. For most serving tiers, JSON parsing/re-serialisation is acceptable; the engineering effort to go zero-allocation is not repaid.
  • Cache values mutate per-request before emission. If every response needs the payload transformed, zero- allocation caching just shifts allocation to the transform stage.
  • Cache is cold / rarely hit. The dominant allocation is the origin fetch, not the cache hit.

Seen in

Last updated · 501 distilled / 1,218 read