Skip to content

PATTERN Cited by 1 source

Deferred-copy cached collection

Intent

When a hot-path operation needs read access to a large shared collection, and the collection is read-mostly (many concurrent readers, occasional writers), avoid the per-reader full-collection copy by maintaining an internal cached snapshot of the collection that:

  • Readers read directly without copying the whole collection.
  • Writers regenerate the snapshot on every modification.

The reader's per-call cost drops from O(N) (where N is the collection size) to O(1) snapshot pointer access plus O(K) for the small filtered subset they actually consume. The cost moves to the rare-write path, which now pays O(N) regeneration on every modification.

Canonical instance: ClickHouse parts-list snapshot

After Cloudflare landed Optimization 1 (shared lock) on ClickHouse's MergeTreeData parts mutex, the lock-contention bottleneck vanished but the new bottleneck shifted to the per-query copy of the parts vector:

"The new flame graph showed the bottleneck had simply moved. Now, time was being spent copying the giant vector of parts, even with the shared lock. Intuitively, copying a vector sounds cheap, but when it contains tens of thousands of elements, and you do it hundreds of times a second, it adds up." (Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)

The fix:

"We deferred the copy entirely. We created a 'shared copy' of the parts list. Read-only operations (like query planning) just read from this copy. Any operation that modifies the set of parts (like a new insert) regenerates the cache. Planners now only copy the filtered list of parts they actually need."

Per-query cost dropped from O(parts) × concurrent-queries to O(filtered-parts) × concurrent-queries, with the collection regeneration cost paid on the writer side (inserts / merges / part-eviction). Optimization 2 of the three-patch fix stack; merged upstream with Opt 1 as ClickHouse PR #85535 in ClickHouse 25.11.

When the pattern applies

The fix is correct iff the access pattern is genuinely read-mostly: the cost moved to the writer side must be acceptable, which requires:

  • Writes are infrequent relative to reads. If writes approach read frequency, regeneration cost dominates.
  • Writers tolerate the regeneration latency — typically by being asynchronous (background merges, retention) rather than synchronous (insert path).
  • Snapshot freshness requirements are weak. Readers see the snapshot from the last write; in flight reads during a write see a slightly stale snapshot, but the staleness is bounded by writer cadence.
  • Memory pressure is acceptable. The pattern keeps one canonical collection plus a snapshot, doubling steady-state memory for the collection. For a parts-list of hundreds of MB this is acceptable; for multi-GB collections may require care.

ClickHouse's MergeTreeData parts list satisfies all four: read traffic (planning) dominates write traffic (parts are created in seconds-scale batches by the merge scheduler); inserts are batched; planners can tolerate millisecond- scale staleness against the parts list (the new part will be visible on the next plan); the parts-list memory cost is modest.

Implementation shapes

The pattern admits several implementations:

  1. Atomic pointer swap to immutable snapshot (the typical C++ shape, what Cloudflare's patch likely implements). The collection is stored behind a shared_ptr<const Collection>; readers atomically load the pointer and read from the immutable snapshot; writers construct a new collection, atomically swap the pointer, and let the old snapshot drop when the last reader finishes. Lock-free on the read side.
  2. RW lock + shared snapshot field (simpler, slightly slower). Same shape but readers acquire a shared lock to read the snapshot pointer rather than using atomics. Cleaner code; lock-acquisition cost on the read path. Composes with Opt 1's shared lock.
  3. Read-Copy-Update (RCU) — Linux kernel canonical form. The most aggressive read-side fast path (zero-cost reads); writers are delayed to the next grace period for old-snapshot reclamation. Reaches for in kernel critical paths.
  4. Sequence locks (seqlocks) — variant for trivially copyable data. Reader retries on detected concurrent write. Niche; useful for very short snapshots.

Trade-offs

  • Memory: collection size during steady state, potentially more during writer regeneration if the reclamation lag is non-trivial.
  • Writer cost: every write pays full-collection regeneration. For collections that grow over time, this is O(N) per write, which compounds if the write rate is non-trivial. Compare against per-write cost without the snapshot (e.g., a single O(1) insertion plus shared-lock release).
  • Staleness: readers may see the previous snapshot for a brief window after a write commits. Iff readers require strict consistency with concurrent writers, this pattern is wrong. For most metadata-list use cases weak consistency is acceptable.
  • Reader-side simplicity: the code becomes simpler, not more complex, on the read path — no copy, no iteration over the full collection, just snapshot read + filter to the subset the reader needs.

Adjacent patterns

  • Copy-on-write data structures generally — persistent functional data structures, RRB trees, immutable.js collections — all bake the same idea (read-side fast path, write-side regeneration cost) into the data structure itself. The deferred-copy cached collection is the explicit-snapshot variant applied to a single shared collection.
  • Cache invalidation pull vs. push — the deferred- copy snapshot is a pull-side cached read; writers invalidate by regenerating; readers read whatever is current. Sibling shape.
  • Materialized views in databases — same intent at a different altitude; precompute the read-shape, refresh on write. The deferred-copy cached collection is the in-process / single-process version.

Hard problems

  • Snapshot freshness vs. write cost trade-off — snapshot regeneration is O(N) on writes. If writes become frequent enough, the cost outweighs the read- side benefit. The crossover point depends on read / write ratio and N.
  • Reclamation timing — the old snapshot must remain valid for in-flight readers. Using shared_ptr / Arc makes reclamation automatic but adds atomic refcount cost; lock-based variants need careful reasoning about when the old snapshot can be freed.
  • Partial-write atomicity — the pattern requires writes to atomically replace the snapshot. If collection regeneration cannot be made atomic (e.g., it must be incremental for memory reasons), the pattern degrades.
  • Multi-collection consistency — the pattern works per-collection. If a reader needs a consistent view across multiple collections (e.g., parts list + schema + index list at a point in time), per-collection snapshots can drift; transactional MVCC at a different altitude becomes the right answer instead.

Seen in

  • sources/2026-05-14-cloudflare-clickhouse-query-plan-contention — canonical wiki instance. ClickHouse MergeTreeData parts list converted from per-query copy of the full collection to a shared cached snapshot regenerated by modifying operations. Cloudflare's flame graphs after Optimization 1 showed the per-query copy as the remaining bottleneck; Optimization 2 deferred the copy. Together with the shared-lock change ships upstream as PR #85535. "Planners now only copy the filtered list of parts they actually need."
Last updated · 542 distilled / 1,571 read