Skip to content

PATTERN Cited by 1 source

Reference-based copy optimization

Reference-based copy optimization is a compaction/merge-side pattern for copy-on-write systems over immutable object storage: when a merge produces a new snapshot, files that are unchanged in this window are referenced into the new snapshot, not rewritten.

The naive compactor rewrites every file every compaction — trivially correct but quadratic in cost as tables grow. Reference-based copy recognises that most files in a typical compaction window are unaffected by the incoming delta; only files touched by an update/delete actually need to be rewritten.

Where it matters

  • Exabyte-scale tables — rewriting the entire table each compaction is prohibitive in CPU, network bandwidth, and object- store egress cost.
  • Heavily-skewed update patterns — most CDC log updates only touch a small fraction of files. Rewriting all of them is waste.
  • Cost-sensitive workloads — compaction is a recurring cost; the difference between "rewrite all" and "reference unchanged" is a multiplicative factor on the compaction bill that compounds forever.

Mechanism

  • The compactor's merge step is aware of which input files it has actually modified.
  • For each modified file, it writes a new file representing the merged output.
  • For each unmodified file, it writes a reference (metadata pointer) in the new snapshot instead of a copy of the bytes.
  • The snapshot's file-list now contains a mix of new-blob references and old-blob references.
  • Old files get garbage-collected only when they fall outside the snapshot retention window (not when "this compaction didn't touch them").

Canonical instantiation

Amazon BDT's Ray compactor contributed to systems/deltacat implements this explicitly. From the Spark → Ray migration post:

"BDT was able to add low-level optimizations like copying untouched Amazon S3 files by reference during a distributed merge on Ray instead of rewriting them." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

With code references:

Why specialists win here, generalists lose

A Spark compactor treating "compact this table" as a standard dataflow DAG has no first-class mechanism for "don't include this input in the rewritten output, but still preserve it in the new snapshot" — you can approximate it by plan-level filtering, but you lose a lot of fidelity. A hand-crafted Ray compactor (concepts/task-and-actor-model + systems/ray) can express this directly because the compaction algorithm is spelled out task-by-task.

This is the concrete mechanism behind Amazon BDT's "specialist beats generalist" outcome: an 82% cost-efficiency improvement vs Spark on compaction, of which this is one named contributor (alongside zero-copy intranode shuffles, autoscaling, memory-aware scheduling, and locality-aware placement).

Seen in

Last updated · 200 distilled / 1,178 read