CONCEPT Cited by 2 sources

Copy-on-write merge (compaction)¶

Copy-on-write merge is the compaction strategy that collapses a stream of CDC deltas (insert / update / delete) into a new, read-optimised set of files that represent current table state — the compactor never mutates the input files, it writes new files that supersede old ones. The alternative — merge-on-read — defers the merge to the query; compactors using merge-on-read keep a base file plus a short delta log and let readers combine them.

The three canonical open table formats (systems/apache-iceberg, systems/apache-hudi, systems/delta-lake) all support this pattern, and Iceberg/Delta default to it. systems/apache-hudi offers both copy-on-write and merge-on-read as explicit table modes.

Why copy-on-write¶

Read-path simplicity. Readers open a single set of immutable files. No per-row merge logic, no log-replay. Cloud query engines that don't know your format can still scan the output.
Stable file sizing. The compactor can enforce target file sizes (split the oversized, merge the undersized) independent of ingest shape.
Predictable scan cost. Query planners see the final row cardinality; pruning works normally; Parquet min/max stats are honest.
Schema evolution is tractable. New schema applied at compaction time; readers never have to reconcile old-schema+new-schema at the row level.

Trade-off: every compaction rewrites the data. That is why reference-based file copies (patterns/reference-based-copy-optimization) are so important at scale — any file whose contents didn't change in this compaction window can be referenced by the new snapshot instead of rewritten.

Why it exists as a named pattern¶

Originated as the standard strategy in LSM storage engines (concepts/lsm-compaction); formalised at the data-lake layer by open table formats starting in the late 2010s. Amazon BDT was building it in-house before the open formats existed — their 2019-era "Spark compactor" is a direct ancestor of the strategy Iceberg, Hudi, and Delta Lake now ship:

"BDT leveraged Apache Spark on Amazon EMR to run the merge once and then write back a read-optimized version of the table for other subscribers to use (a process now referred to as a 'copy-on-write' merge by open-source projects like Apache Iceberg and Apache Hudi)." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Two compaction shapes¶

Append-only compaction. All deltas are inserts; the merge is concatenation. Compactor still has to size output files appropriately (merge tiny files, split huge ones).
Upsert compaction. Append + upsert deltas, with a merge key schema. For each key, the most recent non-delete value wins.

Scale evidence¶

Amazon BDT's copy-on-write compactor, Q1 2024 (running on Ray):

1.5 EiB input Apache Parquet on S3 merged.
4 EiB of in-memory Apache Arrow processed to do it.
Average job: >10 TiB input → merged → rewritten → cluster torn down in <7 minutes.
100% on-time delivery over Q4 2023 + Q1 2024.
>90% of new table updates queryable <60 min after arrival.

patterns/reference-based-copy-optimization — do not rewrite files that didn't change; reference them into the new snapshot.
patterns/streaming-k-way-merge — the efficient merge primitive at the file-system level.
concepts/lsm-compaction — the closest cousin pattern in write-optimised storage engines.
concepts/open-table-format — the family of formats that ship this as a first-class capability.

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon BDT ran a copy-on-write compactor on Spark (2019+) and migrated the largest ~1% of tables to a Ray-native implementation (published here 2024-07-29); Ray implementation contributed to open-source systems/deltacat.
sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite — Expedia Group Tech names copy-on-write as both an Iceberg row-level update strategy and the compaction loop that keeps MOR healthy. The post's load-bearing caveat — "compaction becomes necessary as the number of delta files grows to maintain optimal performance" — is a direct dependency statement: MOR-backed MERGE INTO is only cost- efficient averaged over the copy-on-write compaction window. Qualitative only; no numbers.