Skip to content

PATTERN Cited by 1 source

Prefer MERGE INTO over INSERT OVERWRITE (Iceberg row-level updates)

On systems/apache-iceberg (and other open table formats that expose both surfaces), prefer MERGE INTO over INSERT OVERWRITE for row-level updates. Reserve INSERT OVERWRITE for genuine whole-partition rewrites.

The two SQL surfaces

INSERT OVERWRITE — partition-grain replacement

  • What it does: completely replaces the data in a table or partition with the result of a query.
  • Granularity: whole partition (or whole table).
  • Schema requirements: strict — column names, data types, and order must match.
  • Good fit: batch refresh of an entire partition that is being recomputed from scratch (e.g. a daily reprocessing job that always re-derives dt=2025-09-30 from upstream).
  • Bad fit: any workload touching a fraction of rows; anything where the partitioning scheme evolves over the table's lifetime.

MERGE INTO — row-level conditional upsert

  • What it does: conditionally updates, deletes, or inserts rows based on a matching condition (ON …); only modifies affected rows; leaves the rest untouched.
  • Granularity: row.
  • Schema requirements: more flexible — does not require exact column order or names to match.
  • Good fit: CDC ingests, slowly-changing dimensions, targeted upserts, delete-by-predicate, merge-on-key workloads.
  • Paired strategy: usually MOR for the write-side efficiency, with periodic copy-on-write compaction to keep the read side fast.

(Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)

Why MERGE INTO is the default

The practical win is write amplification avoidance. INSERT OVERWRITE rewrites every file in the touched partition even when only a handful of rows changed; MERGE INTO over MOR writes only the delta. Expedia names the wins explicitly:

  • Faster writes — deltas, not base-file rewrites.
  • Reduced I/O — fewer object-store operations per mutation.
  • Improved query performance — when paired with compaction; without compaction MOR loses its read-side story.

Lower storage costs and lower compute costs fall out of the same cause: you're not rewriting data that didn't change.

The decision rubric

Question Answer
Are you replacing an entire partition from upstream recompute? INSERT OVERWRITE
Is the partitioning scheme stable and the batch always full? INSERT OVERWRITE
Are you applying a stream of CDC updates? MERGE INTO + MOR
Are you maintaining an SCD dimension table? MERGE INTO + MOR
Are you doing targeted upsert-by-key? MERGE INTO
Do you have a natural merge key? MERGE INTO is straightforward
No natural merge key? MERGE INTO still works but scope the merge carefully — see the companion Expedia article on well-scoped MERGE statements linked from the source post

Load-bearing caveat: compaction is not optional

MOR-backed MERGE INTO depends on periodic copy-on-write compaction to keep read-side performance bounded. Without it, accumulated delta files force per-query merge work that grows linearly with mutation volume.

"Compaction becomes necessary as the number of delta files grows to maintain optimal performance." (Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)

This means adopting MERGE INTO is a two-step commitment: the write-time SQL surface and an operational compaction pipeline (Iceberg's rewrite_data_files, Hudi's Compactor, Delta's OPTIMIZE, or a bespoke compactor like Amazon BDT's systems/deltacat Flash Compactor).

Anti-patterns

  • INSERT OVERWRITE as the row-level update mechanism — common legacy Hive-style habit that survives into Iceberg deployments because the syntax is familiar. Rewrites orders of magnitude more data than necessary on any targeted update.
  • MOR with no compaction — a table that grows an unbounded delta log loses all read-side properties; the "fast writes" savings are eaten by unbounded read cost.
  • Mixing INSERT OVERWRITE and MERGE INTO on the same partition — schema-order strictness of INSERT OVERWRITE can silently corrupt data if columns are reordered by an upstream change, especially when interleaved with MERGE INTO writes that don't enforce order.

Seen in

Last updated · 200 distilled / 1,178 read