Expedia — Why You Should Prefer MERGE INTO Over INSERT OVERWRITE in Apache Iceberg¶
Summary¶
A short Expedia Group Tech post arguing that on systems/apache-iceberg
tables, teams should default to MERGE INTO (row-level
conditional upsert, usually paired with Merge-on-Read / MOR) instead
of the legacy INSERT OVERWRITE (full-partition rewrite). The post
walks the two primary Iceberg row-level strategies — Copy-on-Write
(COW) and Merge-on-Read (MOR) — positions INSERT OVERWRITE as a
third, coarser alternative that works at partition granularity, and
enumerates the performance and cost advantages of MOR-backed
MERGE INTO for incremental workloads (CDC, slowly-changing
dimensions, targeted updates).
Key takeaways¶
-
Iceberg has two row-level update strategies: Copy-on-Write (COW) rewrites entire data files when any row changes (strong consistency, immediate visibility, resource-intensive), and Merge-on-Read (MOR) stores updates as separate delta files that are combined with base files at query time (write-optimized; concepts/merge-on-read). Both are distinct from
INSERT OVERWRITE, which operates at whole-partition granularity. (Source: this post) -
INSERT OVERWRITEis a partition-level blunt instrument: completely replaces data in a table or partition with the result of a query. Good fit for batch partition refreshes where the entire partition is being recomputed. Bad fit for targeted row updates — it rewrites far more data than necessary, and its strict schema requirements (column names, data types, and order must match) become a trap when the partitioning scheme changes. (Source: this post) -
MERGE INTOis the row-level surgical tool: conditionally updates, deletes, or inserts rows based on a matching condition; only modifies affected rows, leaving the rest of the data intact. Schema requirements are more flexible (does not require exact column order or names). Canonical fit for change data capture (CDC), slowly-changing dimensions (SCD), and other incremental update workloads. (Source: this post) -
MOR-backed
MERGE INTOhas three named performance wins: - Faster writes — changes are appended as delta files rather than rewriting entire base files.
- Reduced I/O — only affected data is modified, which minimizes the number of object-store operations and cuts cost.
-
Query performance — MOR combines delta files with base files at read time; queries stay fast provided delta-file count is kept bounded by periodic compaction. (Source: this post)
-
The load-bearing caveat: compaction becomes mandatory on MOR. "Compaction becomes necessary as the number of delta files grows to maintain optimal performance." Without a copy-on-write compaction pass collapsing accumulated deltas back into read-optimized files, MOR's write win is eaten by read-side merge cost. This is the canonical MOR trade-off. (Source: this post)
-
Cost framing: MOR is positioned as cost-saving on three axes — lower storage costs (avoids rewriting whole partitions), efficient resource utilization on writes, and scale-with-volume behavior that preserves write performance as data volume grows. The post does not disclose numbers — this is a qualitative prescription, not a benchmark. (Source: this post)
-
Decision rubric (implicit in the post):
- Periodic full-partition refresh, known partition boundaries →
INSERT OVERWRITEstill fits. - Targeted row-level updates (CDC, SCD, merges with a natural
key) →
MERGE INTOover MOR by default. - MERGE INTO without a natural key → possible but requires care with merge-statement scoping; post links out to a companion article on well-scoped MERGE statements for exactly this case.
Systems¶
- systems/apache-iceberg — the open table format whose
MERGE INTOandINSERT OVERWRITESQL surfaces are the subject of the post. Both COW and MOR are Iceberg-native row-level-update strategies.
Concepts¶
- concepts/copy-on-write-merge — Iceberg's COW strategy; also the compaction pattern that keeps MOR healthy over time. The post names both roles (update-strategy + compaction-for-MOR) within the same article.
- concepts/merge-on-read — the delta-log + base-file model the
post recommends pairing with
MERGE INTO. - concepts/change-data-capture — the upstream workload shape that
makes
MERGE INTOthe right default. - concepts/slowly-changing-dimension — named use case; classic warehouse/dimensional-modeling pattern that MERGE handles correctly and INSERT OVERWRITE handles coarsely.
- concepts/open-table-format — the family that makes row-level semantics even possible over object storage.
Patterns¶
- patterns/merge-into-over-insert-overwrite — the operational
prescription this post is named for: default to
MERGE INTOon Iceberg row-level updates; reserveINSERT OVERWRITEfor genuine partition-scale rewrites.
Operational numbers¶
None disclosed. The post is a qualitative recommendation, not a benchmark or incident retrospective. Reported trade-offs are directional (faster / cheaper / more efficient) rather than measured.
Caveats¶
- No benchmarks: no ingest rate, no query-latency numbers, no cost breakdown, no before/after comparison at Expedia scale. The recommendation is sound but unquantified in this post.
- No Expedia-specific production detail: the post reads as a general Iceberg-best-practices primer rather than an Expedia retrospective. No mention of internal data-platform architecture, job-orchestration, or incidents that drove the prescription.
- MOR's read-side compaction story is hand-waved: "compaction becomes necessary" is named as the load-bearing caveat but the post doesn't describe Expedia's compaction policy (trigger, cadence, file-size targets, scheduler). The companion article linked in the body — Boost Iceberg performance and cut compute costs with well-scoped merge statements — addresses MERGE-statement scoping but not end-to-end compaction operations.
- COW vs MOR is presented as a binary; in practice both Iceberg and systems/apache-hudi allow per-operation or per-partition choice of strategy, and mixing is common.
INSERT OVERWRITE's schema-order-strict footgun is named ("column names, data types, and order must match") but not demonstrated with an example.
Raw source¶
- Raw file:
raw/expedia/2025-09-30-why-you-should-prefer-merge-into-over-insert-overwrite-in-ap-a8e5fcb9.md - Original URL: https://medium.com/expedia-group-tech/why-you-should-prefer-merge-into-over-insert-overwrite-in-apache-iceberg-b6b130cc27d2
- Fetched: 2026-04-21
- Published: 2025-09-30