CONCEPT Cited by 2 sources
Copy-on-write merge (compaction)¶
Copy-on-write merge is the compaction strategy that collapses a stream of CDC deltas (insert / update / delete) into a new, read-optimised set of files that represent current table state — the compactor never mutates the input files, it writes new files that supersede old ones. The alternative — merge-on-read — defers the merge to the query; compactors using merge-on-read keep a base file plus a short delta log and let readers combine them.
The three canonical open table formats (systems/apache-iceberg, systems/apache-hudi, systems/delta-lake) all support this pattern, and Iceberg/Delta default to it. systems/apache-hudi offers both copy-on-write and merge-on-read as explicit table modes.
Why copy-on-write¶
- Read-path simplicity. Readers open a single set of immutable files. No per-row merge logic, no log-replay. Cloud query engines that don't know your format can still scan the output.
- Stable file sizing. The compactor can enforce target file sizes (split the oversized, merge the undersized) independent of ingest shape.
- Predictable scan cost. Query planners see the final row cardinality; pruning works normally; Parquet min/max stats are honest.
- Schema evolution is tractable. New schema applied at compaction time; readers never have to reconcile old-schema+new-schema at the row level.
Trade-off: every compaction rewrites the data. That is why reference-based file copies (patterns/reference-based-copy-optimization) are so important at scale — any file whose contents didn't change in this compaction window can be referenced by the new snapshot instead of rewritten.
Why it exists as a named pattern¶
Originated as the standard strategy in LSM storage engines (concepts/lsm-compaction); formalised at the data-lake layer by open table formats starting in the late 2010s. Amazon BDT was building it in-house before the open formats existed — their 2019-era "Spark compactor" is a direct ancestor of the strategy Iceberg, Hudi, and Delta Lake now ship:
"BDT leveraged Apache Spark on Amazon EMR to run the merge once and then write back a read-optimized version of the table for other subscribers to use (a process now referred to as a 'copy-on-write' merge by open-source projects like Apache Iceberg and Apache Hudi)." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Two compaction shapes¶
- Append-only compaction. All deltas are inserts; the merge is concatenation. Compactor still has to size output files appropriately (merge tiny files, split huge ones).
- Upsert compaction. Append + upsert deltas, with a merge key schema. For each key, the most recent non-delete value wins.
Scale evidence¶
Amazon BDT's copy-on-write compactor, Q1 2024 (running on Ray):
- 1.5 EiB input Apache Parquet on S3 merged.
- 4 EiB of in-memory Apache Arrow processed to do it.
- Average job: >10 TiB input → merged → rewritten → cluster torn down in <7 minutes.
- 100% on-time delivery over Q4 2023 + Q1 2024.
- >90% of new table updates queryable <60 min after arrival.
Related patterns¶
- patterns/reference-based-copy-optimization — do not rewrite files that didn't change; reference them into the new snapshot.
- patterns/streaming-k-way-merge — the efficient merge primitive at the file-system level.
- concepts/lsm-compaction — the closest cousin pattern in write-optimised storage engines.
- concepts/open-table-format — the family of formats that ship this as a first-class capability.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon BDT ran a copy-on-write compactor on Spark (2019+) and migrated the largest ~1% of tables to a Ray-native implementation (published here 2024-07-29); Ray implementation contributed to open-source systems/deltacat.
- sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
— Expedia Group Tech names copy-on-write as both an Iceberg
row-level update strategy and the compaction loop that keeps
MOR healthy. The post's load-bearing
caveat — "compaction becomes necessary as the number of delta
files grows to maintain optimal performance" — is a direct
dependency statement: MOR-backed
MERGE INTOis only cost- efficient averaged over the copy-on-write compaction window. Qualitative only; no numbers.