PATTERN Cited by 1 source
Reference-based copy optimization¶
Reference-based copy optimization is a compaction/merge-side pattern for copy-on-write systems over immutable object storage: when a merge produces a new snapshot, files that are unchanged in this window are referenced into the new snapshot, not rewritten.
The naive compactor rewrites every file every compaction — trivially correct but quadratic in cost as tables grow. Reference-based copy recognises that most files in a typical compaction window are unaffected by the incoming delta; only files touched by an update/delete actually need to be rewritten.
Where it matters¶
- Exabyte-scale tables — rewriting the entire table each compaction is prohibitive in CPU, network bandwidth, and object- store egress cost.
- Heavily-skewed update patterns — most CDC log updates only touch a small fraction of files. Rewriting all of them is waste.
- Cost-sensitive workloads — compaction is a recurring cost; the difference between "rewrite all" and "reference unchanged" is a multiplicative factor on the compaction bill that compounds forever.
Mechanism¶
- The compactor's merge step is aware of which input files it has actually modified.
- For each modified file, it writes a new file representing the merged output.
- For each unmodified file, it writes a reference (metadata pointer) in the new snapshot instead of a copy of the bytes.
- The snapshot's file-list now contains a mix of new-blob references and old-blob references.
- Old files get garbage-collected only when they fall outside the snapshot retention window (not when "this compaction didn't touch them").
Canonical instantiation¶
Amazon BDT's Ray compactor contributed to systems/deltacat implements this explicitly. From the Spark → Ray migration post:
"BDT was able to add low-level optimizations like copying untouched Amazon S3 files by reference during a distributed merge on Ray instead of rewriting them." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
With code references:
deltacat/compute/compactor_v2/steps/merge.py#L199-L257— the reference-copy step in the DeltaCAT Ray compactor.deltacat/compute/compactor_v2/compaction_session.py#L251— the distributed-merge entrypoint that uses it.
Why specialists win here, generalists lose¶
A Spark compactor treating "compact this table" as a standard dataflow DAG has no first-class mechanism for "don't include this input in the rewritten output, but still preserve it in the new snapshot" — you can approximate it by plan-level filtering, but you lose a lot of fidelity. A hand-crafted Ray compactor (concepts/task-and-actor-model + systems/ray) can express this directly because the compaction algorithm is spelled out task-by-task.
This is the concrete mechanism behind Amazon BDT's "specialist beats generalist" outcome: an 82% cost-efficiency improvement vs Spark on compaction, of which this is one named contributor (alongside zero-copy intranode shuffles, autoscaling, memory-aware scheduling, and locality-aware placement).
Related patterns¶
- concepts/copy-on-write-merge — the workload shape that makes this optimisation possible and necessary.
- patterns/streaming-k-way-merge — the CPU-bound merge primitive that reference-based copying keeps from dominating I/O.
- systems/apache-iceberg, systems/delta-lake, systems/apache-hudi — open table formats whose snapshot model natively supports per-file references, making this pattern first-class rather than bolted on.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — the systems/deltacat Ray compactor's reference-copy step; named as a key contributor to the 82% cost-efficiency gain vs Spark at exabyte scale on Amazon Retail's BDT catalog.