PATTERN Cited by 1 source

Reference-based copy optimization¶

Reference-based copy optimization is a compaction/merge-side pattern for copy-on-write systems over immutable object storage: when a merge produces a new snapshot, files that are unchanged in this window are referenced into the new snapshot, not rewritten.

The naive compactor rewrites every file every compaction — trivially correct but quadratic in cost as tables grow. Reference-based copy recognises that most files in a typical compaction window are unaffected by the incoming delta; only files touched by an update/delete actually need to be rewritten.

Where it matters¶

Exabyte-scale tables — rewriting the entire table each compaction is prohibitive in CPU, network bandwidth, and object- store egress cost.
Heavily-skewed update patterns — most CDC log updates only touch a small fraction of files. Rewriting all of them is waste.
Cost-sensitive workloads — compaction is a recurring cost; the difference between "rewrite all" and "reference unchanged" is a multiplicative factor on the compaction bill that compounds forever.

Mechanism¶

The compactor's merge step is aware of which input files it has actually modified.
For each modified file, it writes a new file representing the merged output.
For each unmodified file, it writes a reference (metadata pointer) in the new snapshot instead of a copy of the bytes.
The snapshot's file-list now contains a mix of new-blob references and old-blob references.
Old files get garbage-collected only when they fall outside the snapshot retention window (not when "this compaction didn't touch them").

Canonical instantiation¶

Amazon BDT's Ray compactor contributed to systems/deltacat implements this explicitly. From the Spark → Ray migration post:

"BDT was able to add low-level optimizations like copying untouched Amazon S3 files by reference during a distributed merge on Ray instead of rewriting them." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

With code references:

deltacat/compute/compactor_v2/steps/merge.py#L199-L257 — the reference-copy step in the DeltaCAT Ray compactor.
deltacat/compute/compactor_v2/compaction_session.py#L251 — the distributed-merge entrypoint that uses it.

Why specialists win here, generalists lose¶

A Spark compactor treating "compact this table" as a standard dataflow DAG has no first-class mechanism for "don't include this input in the rewritten output, but still preserve it in the new snapshot" — you can approximate it by plan-level filtering, but you lose a lot of fidelity. A hand-crafted Ray compactor (concepts/task-and-actor-model + systems/ray) can express this directly because the compaction algorithm is spelled out task-by-task.

This is the concrete mechanism behind Amazon BDT's "specialist beats generalist" outcome: an 82% cost-efficiency improvement vs Spark on compaction, of which this is one named contributor (alongside zero-copy intranode shuffles, autoscaling, memory-aware scheduling, and locality-aware placement).

concepts/copy-on-write-merge — the workload shape that makes this optimisation possible and necessary.
patterns/streaming-k-way-merge — the CPU-bound merge primitive that reference-based copying keeps from dominating I/O.
systems/apache-iceberg, systems/delta-lake, systems/apache-hudi — open table formats whose snapshot model natively supports per-file references, making this pattern first-class rather than bolted on.

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — the systems/deltacat Ray compactor's reference-copy step; named as a key contributor to the 82% cost-efficiency gain vs Spark at exabyte scale on Amazon Retail's BDT catalog.