PATTERN Cited by 1 source
Deferred-copy cached collection¶
Intent¶
When a hot-path operation needs read access to a large shared collection, and the collection is read-mostly (many concurrent readers, occasional writers), avoid the per-reader full-collection copy by maintaining an internal cached snapshot of the collection that:
- Readers read directly without copying the whole collection.
- Writers regenerate the snapshot on every modification.
The reader's per-call cost drops from O(N) (where N is
the collection size) to O(1) snapshot pointer access plus
O(K) for the small filtered subset they actually consume.
The cost moves to the rare-write path, which now pays
O(N) regeneration on every modification.
Canonical instance: ClickHouse parts-list snapshot¶
After Cloudflare landed
Optimization 1
(shared lock) on ClickHouse's MergeTreeData parts mutex,
the lock-contention bottleneck vanished but the new bottleneck
shifted to the per-query copy of the parts vector:
"The new flame graph showed the bottleneck had simply moved. Now, time was being spent copying the giant vector of parts, even with the shared lock. Intuitively, copying a vector sounds cheap, but when it contains tens of thousands of elements, and you do it hundreds of times a second, it adds up." (Source: sources/2026-05-14-cloudflare-clickhouse-query-plan-contention)
The fix:
"We deferred the copy entirely. We created a 'shared copy' of the parts list. Read-only operations (like query planning) just read from this copy. Any operation that modifies the set of parts (like a new insert) regenerates the cache. Planners now only copy the filtered list of parts they actually need."
Per-query cost dropped from O(parts) × concurrent-queries
to O(filtered-parts) × concurrent-queries, with the
collection regeneration cost paid on the writer side
(inserts / merges / part-eviction). Optimization 2 of the
three-patch fix stack; merged upstream with Opt 1 as
ClickHouse PR #85535
in ClickHouse 25.11.
When the pattern applies¶
The fix is correct iff the access pattern is genuinely read-mostly: the cost moved to the writer side must be acceptable, which requires:
- Writes are infrequent relative to reads. If writes approach read frequency, regeneration cost dominates.
- Writers tolerate the regeneration latency — typically by being asynchronous (background merges, retention) rather than synchronous (insert path).
- Snapshot freshness requirements are weak. Readers see the snapshot from the last write; in flight reads during a write see a slightly stale snapshot, but the staleness is bounded by writer cadence.
- Memory pressure is acceptable. The pattern keeps one canonical collection plus a snapshot, doubling steady-state memory for the collection. For a parts-list of hundreds of MB this is acceptable; for multi-GB collections may require care.
ClickHouse's MergeTreeData parts list satisfies all four:
read traffic (planning) dominates write traffic (parts are
created in seconds-scale batches by the merge scheduler);
inserts are batched; planners can tolerate millisecond-
scale staleness against the parts list (the new part will
be visible on the next plan); the parts-list memory cost is
modest.
Implementation shapes¶
The pattern admits several implementations:
- Atomic pointer swap to immutable snapshot (the
typical C++ shape, what Cloudflare's patch likely
implements). The collection is stored behind a
shared_ptr<const Collection>; readers atomically load the pointer and read from the immutable snapshot; writers construct a new collection, atomically swap the pointer, and let the old snapshot drop when the last reader finishes. Lock-free on the read side. - RW lock + shared snapshot field (simpler, slightly slower). Same shape but readers acquire a shared lock to read the snapshot pointer rather than using atomics. Cleaner code; lock-acquisition cost on the read path. Composes with Opt 1's shared lock.
- Read-Copy-Update (RCU) — Linux kernel canonical form. The most aggressive read-side fast path (zero-cost reads); writers are delayed to the next grace period for old-snapshot reclamation. Reaches for in kernel critical paths.
- Sequence locks (seqlocks) — variant for trivially copyable data. Reader retries on detected concurrent write. Niche; useful for very short snapshots.
Trade-offs¶
- Memory:
2×collection size during steady state, potentially more during writer regeneration if the reclamation lag is non-trivial. - Writer cost: every write pays full-collection
regeneration. For collections that grow over time, this
is
O(N)per write, which compounds if the write rate is non-trivial. Compare against per-write cost without the snapshot (e.g., a singleO(1)insertion plus shared-lock release). - Staleness: readers may see the previous snapshot for a brief window after a write commits. Iff readers require strict consistency with concurrent writers, this pattern is wrong. For most metadata-list use cases weak consistency is acceptable.
- Reader-side simplicity: the code becomes simpler, not more complex, on the read path — no copy, no iteration over the full collection, just snapshot read + filter to the subset the reader needs.
Adjacent patterns¶
- Copy-on-write data structures generally — persistent functional data structures, RRB trees, immutable.js collections — all bake the same idea (read-side fast path, write-side regeneration cost) into the data structure itself. The deferred-copy cached collection is the explicit-snapshot variant applied to a single shared collection.
- Cache invalidation pull vs. push — the deferred- copy snapshot is a pull-side cached read; writers invalidate by regenerating; readers read whatever is current. Sibling shape.
- Materialized views in databases — same intent at a different altitude; precompute the read-shape, refresh on write. The deferred-copy cached collection is the in-process / single-process version.
Hard problems¶
- Snapshot freshness vs. write cost trade-off —
snapshot regeneration is
O(N)on writes. If writes become frequent enough, the cost outweighs the read- side benefit. The crossover point depends on read / write ratio andN. - Reclamation timing — the old snapshot must remain
valid for in-flight readers. Using
shared_ptr/Arcmakes reclamation automatic but adds atomic refcount cost; lock-based variants need careful reasoning about when the old snapshot can be freed. - Partial-write atomicity — the pattern requires writes to atomically replace the snapshot. If collection regeneration cannot be made atomic (e.g., it must be incremental for memory reasons), the pattern degrades.
- Multi-collection consistency — the pattern works per-collection. If a reader needs a consistent view across multiple collections (e.g., parts list + schema + index list at a point in time), per-collection snapshots can drift; transactional MVCC at a different altitude becomes the right answer instead.
Seen in¶
- sources/2026-05-14-cloudflare-clickhouse-query-plan-contention
— canonical wiki instance. ClickHouse
MergeTreeDataparts list converted from per-query copy of the full collection to a shared cached snapshot regenerated by modifying operations. Cloudflare's flame graphs after Optimization 1 showed the per-query copy as the remaining bottleneck; Optimization 2 deferred the copy. Together with the shared-lock change ships upstream as PR #85535. "Planners now only copy the filtered list of parts they actually need."
Related¶
- systems/clickhouse
- systems/cloudflare-ready-analytics
- concepts/lock-contention-in-query-planning
- concepts/clickhouse-data-part
- concepts/shared-lock-vs-exclusive-lock
- patterns/shared-lock-for-read-only-metadata — the prior optimisation step; this pattern composes with it.
- patterns/binary-search-on-sorted-partition-prefix — the next optimisation step on top of this snapshot.
- patterns/upstream-the-fix — the contribution stance for the patch.