PATTERN Cited by 1 source
Prefer MERGE INTO over INSERT OVERWRITE (Iceberg row-level updates)¶
On systems/apache-iceberg (and other open table formats that
expose both surfaces), prefer MERGE INTO over INSERT OVERWRITE
for row-level updates. Reserve INSERT OVERWRITE for genuine
whole-partition rewrites.
The two SQL surfaces¶
INSERT OVERWRITE — partition-grain replacement¶
- What it does: completely replaces the data in a table or partition with the result of a query.
- Granularity: whole partition (or whole table).
- Schema requirements: strict — column names, data types, and order must match.
- Good fit: batch refresh of an entire partition that is being
recomputed from scratch (e.g. a daily reprocessing job that
always re-derives
dt=2025-09-30from upstream). - Bad fit: any workload touching a fraction of rows; anything where the partitioning scheme evolves over the table's lifetime.
MERGE INTO — row-level conditional upsert¶
- What it does: conditionally updates, deletes, or inserts rows
based on a matching condition (
ON …); only modifies affected rows; leaves the rest untouched. - Granularity: row.
- Schema requirements: more flexible — does not require exact column order or names to match.
- Good fit: CDC ingests, slowly-changing dimensions, targeted upserts, delete-by-predicate, merge-on-key workloads.
- Paired strategy: usually MOR for the write-side efficiency, with periodic copy-on-write compaction to keep the read side fast.
(Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)
Why MERGE INTO is the default¶
The practical win is write amplification avoidance. INSERT
OVERWRITE rewrites every file in the touched partition even when
only a handful of rows changed; MERGE INTO over MOR writes only the
delta. Expedia names the wins explicitly:
- Faster writes — deltas, not base-file rewrites.
- Reduced I/O — fewer object-store operations per mutation.
- Improved query performance — when paired with compaction; without compaction MOR loses its read-side story.
Lower storage costs and lower compute costs fall out of the same cause: you're not rewriting data that didn't change.
The decision rubric¶
| Question | Answer |
|---|---|
| Are you replacing an entire partition from upstream recompute? | INSERT OVERWRITE |
| Is the partitioning scheme stable and the batch always full? | INSERT OVERWRITE |
| Are you applying a stream of CDC updates? | MERGE INTO + MOR |
| Are you maintaining an SCD dimension table? | MERGE INTO + MOR |
| Are you doing targeted upsert-by-key? | MERGE INTO |
| Do you have a natural merge key? | MERGE INTO is straightforward |
| No natural merge key? | MERGE INTO still works but scope the merge carefully — see the companion Expedia article on well-scoped MERGE statements linked from the source post |
Load-bearing caveat: compaction is not optional¶
MOR-backed MERGE INTO depends on periodic
copy-on-write compaction to keep
read-side performance bounded. Without it, accumulated delta files
force per-query merge work that grows linearly with mutation volume.
"Compaction becomes necessary as the number of delta files grows to maintain optimal performance." (Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)
This means adopting MERGE INTO is a two-step commitment: the
write-time SQL surface and an operational compaction pipeline
(Iceberg's rewrite_data_files, Hudi's Compactor, Delta's
OPTIMIZE, or a bespoke compactor like Amazon BDT's
systems/deltacat Flash Compactor).
Anti-patterns¶
INSERT OVERWRITEas the row-level update mechanism — common legacy Hive-style habit that survives into Iceberg deployments because the syntax is familiar. Rewrites orders of magnitude more data than necessary on any targeted update.- MOR with no compaction — a table that grows an unbounded delta log loses all read-side properties; the "fast writes" savings are eaten by unbounded read cost.
- Mixing
INSERT OVERWRITEandMERGE INTOon the same partition — schema-order strictness ofINSERT OVERWRITEcan silently corrupt data if columns are reordered by an upstream change, especially when interleaved withMERGE INTOwrites that don't enforce order.
Seen in¶
- sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite — Expedia Group Tech primer. Origin of this pattern's statement in this wiki.