Skip to content

CONCEPT Cited by 2 sources

Copy-on-write merge (compaction)

Copy-on-write merge is the compaction strategy that collapses a stream of CDC deltas (insert / update / delete) into a new, read-optimised set of files that represent current table state — the compactor never mutates the input files, it writes new files that supersede old ones. The alternative — merge-on-read — defers the merge to the query; compactors using merge-on-read keep a base file plus a short delta log and let readers combine them.

The three canonical open table formats (systems/apache-iceberg, systems/apache-hudi, systems/delta-lake) all support this pattern, and Iceberg/Delta default to it. systems/apache-hudi offers both copy-on-write and merge-on-read as explicit table modes.

Why copy-on-write

  • Read-path simplicity. Readers open a single set of immutable files. No per-row merge logic, no log-replay. Cloud query engines that don't know your format can still scan the output.
  • Stable file sizing. The compactor can enforce target file sizes (split the oversized, merge the undersized) independent of ingest shape.
  • Predictable scan cost. Query planners see the final row cardinality; pruning works normally; Parquet min/max stats are honest.
  • Schema evolution is tractable. New schema applied at compaction time; readers never have to reconcile old-schema+new-schema at the row level.

Trade-off: every compaction rewrites the data. That is why reference-based file copies (patterns/reference-based-copy-optimization) are so important at scale — any file whose contents didn't change in this compaction window can be referenced by the new snapshot instead of rewritten.

Why it exists as a named pattern

Originated as the standard strategy in LSM storage engines (concepts/lsm-compaction); formalised at the data-lake layer by open table formats starting in the late 2010s. Amazon BDT was building it in-house before the open formats existed — their 2019-era "Spark compactor" is a direct ancestor of the strategy Iceberg, Hudi, and Delta Lake now ship:

"BDT leveraged Apache Spark on Amazon EMR to run the merge once and then write back a read-optimized version of the table for other subscribers to use (a process now referred to as a 'copy-on-write' merge by open-source projects like Apache Iceberg and Apache Hudi)." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Two compaction shapes

  • Append-only compaction. All deltas are inserts; the merge is concatenation. Compactor still has to size output files appropriately (merge tiny files, split huge ones).
  • Upsert compaction. Append + upsert deltas, with a merge key schema. For each key, the most recent non-delete value wins.

Scale evidence

Amazon BDT's copy-on-write compactor, Q1 2024 (running on Ray):

  • 1.5 EiB input Apache Parquet on S3 merged.
  • 4 EiB of in-memory Apache Arrow processed to do it.
  • Average job: >10 TiB input → merged → rewritten → cluster torn down in <7 minutes.
  • 100% on-time delivery over Q4 2023 + Q1 2024.
  • >90% of new table updates queryable <60 min after arrival.

Seen in

Last updated · 200 distilled / 1,178 read