SYSTEM Cited by 6 sources
Apache Iceberg¶
Apache Iceberg is an open table format that sits above columnar data files (typically systems/apache-parquet) on an object store to provide a table abstraction — atomic row-level updates, schema evolution, hidden partitioning, and snapshot-based versioning — over storage that is fundamentally immutable and object-scoped. Originally open-sourced in 2017.
Why it exists (per the S3-at-19 post)¶
Parquet became the de-facto tabular on-object format in the early 2010s, combined with Hadoop/Hive for "data lake" tables. But customer demand moved beyond append-only scans to:
- Row-level insert/update without rewriting entire tables.
- Schema evolution (add/remove/rename columns without migration).
- Time travel / versioning — query the state of the table at a specific point in the past.
These are hard to achieve directly over immutable objects — see concepts/immutable-object-storage.
(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)
How Iceberg solves it¶
A metadata layer over the data objects: each logical table version is a snapshot that describes the set of data files making up the table at that moment. Mutations produce a new snapshot that references mostly-unchanged files plus the small delta; consumers read by resolving the current snapshot to its file set.
Result:
- Small updates don't require rewriting the whole table.
- The table is implicitly versioned — stepping backward/forward in time is reading an older/newer snapshot pointer.
- Snapshots are the atomic unit, which gives databases the transactional semantics they expect.
Externalisation cost (the gap S3 Tables closes)¶
Because Iceberg's structure is externalised — customer code owns the data/metadata object relationships — several burdens fall on the customer:
- Compaction to merge small snapshot deltas into larger files and recover scan performance.
- Garbage collection to reclaim space from superseded snapshot files.
- Tiering-policy awareness — S3 Intelligent-Tiering doesn't know about Iceberg's logical layout, so tiering can misbehave.
Andy Warfield (2025):
"Iceberg and other open table formats like it are effectively storage systems in their own right, but because their structure is externalised – customer code manages the relationship between iceberg data and metadata objects, and performs tasks like garbage collection – some challenges emerge."
This gap is what motivated systems/s3-tables — S3 absorbing compaction / GC / tiering as managed operations, and exposing the table itself as the first-class policy resource.
Ecosystem¶
- Iceberg REST Catalog (IRC) API — standard catalog protocol Iceberg clients speak; S3 Tables added IRC support within 14 weeks of launch.
- DuckDB Iceberg — collaboration called out in the 2025 S3 post to accelerate in-engine Iceberg reads.
- Native readers/writers in Spark, Flink, Trino, Snowflake, Presto, and others.
Seen in¶
- sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — Iceberg's design, what it externalises, and the motivation for pulling it into S3 as a first-class construct.
- sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Iceberg on AWS Glue as the Mercedes-Benz after-sales source of truth; systems/unity-catalog federates the Iceberg tables so they can be shared out as Delta via systems/delta-sharing without rewriting into Delta first — a concrete case of OTF→OTF translation at the catalog/federation boundary.
- sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3 — Warfield's S3 Files launch essay positions Iceberg as the structural-data precursor that led to the platform's patterns/presentation-layer-over-storage thesis: "Iceberg was clearly helping lift the level of abstraction for tabular data on S3, [but] it also still carried a set of sharp edges because it was having to surface tables strictly over the object API." Customer scale cited: over 2M tables stored in systems/s3-tables today, off the back of the Iceberg installed base.
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named as one of the three open table formats Amazon BDT's Ray compactor (contributed to systems/deltacat) is designed to extend to, alongside systems/apache-hudi and systems/delta-lake. The post also recognises Iceberg (and Hudi) as having canonicalised the "copy-on-write merge" name for the compaction strategy Amazon BDT was already running in-house in 2019 over its exabyte-scale catalog — years before publishing the shared vocabulary.
Row-level update surfaces: MERGE INTO vs INSERT OVERWRITE¶
Iceberg exposes two SQL surfaces for updating table state, and they operate at very different granularities:
INSERT OVERWRITE— replaces an entire partition (or table). Strict schema requirements (column names, types, and order must match). Good for batch full-partition refreshes; a footgun for targeted row-level updates because it rewrites orders of magnitude more data than necessary.MERGE INTO— conditional row-level upsert/delete/insert against a matching condition; only touches affected rows; more flexible schema matching. Canonical fit for CDC ingest, slowly-changing dimensions, and incremental merges.
Underneath MERGE INTO, Iceberg offers two row-level update
strategies:
- Copy-on-Write (COW) — rewrite the entire data file for any row change. Strong consistency, immediate visibility, resource- intensive. See concepts/copy-on-write-merge.
- Merge-on-Read (MOR) — write updates as separate delta files merged with base data at query time. Optimizes write performance; requires periodic compaction to keep the read side fast. See concepts/merge-on-read.
Operational prescription (patterns/merge-into-over-insert-overwrite):
default to MERGE INTO over MOR for incremental workloads; reserve
INSERT OVERWRITE for genuine full-partition rewrites. The
load-bearing caveat is that MOR-backed MERGE INTO only stays fast
if an operator runs periodic copy-on-write compaction (Iceberg's
rewrite_data_files action, or — at exabyte scale —
systems/deltacat).
(Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)
Seen in (additional)¶
- sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
— Expedia Group Tech primer naming
MERGE INTO(row-level, MOR) as the default for Iceberg incremental updates andINSERT OVERWRITE(partition-level) as the legacy blunt instrument; qualitative trade-offs only, no numbers. Source for concepts/merge-on-read and patterns/merge-into-over-insert-overwrite on this wiki. - sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Iceberg as one of the downstream sink destinations in Datadog's managed multi-tenant CDC platform. Postgres → Iceberg pipelines are cited for "scalable, event-driven analytics" — one of five sink classes generalised from the original Postgres-to-search seed pipeline. No Iceberg-internals detail in the post (the reference is to Apache Flink CDC's Iceberg connector); value is the data point that CDC-to-Iceberg is a first-class replication shape at Datadog scale alongside CDC-to-search and DB-to-DB replication.