Skip to content

SYSTEM Cited by 6 sources

Apache Iceberg

Apache Iceberg is an open table format that sits above columnar data files (typically systems/apache-parquet) on an object store to provide a table abstraction — atomic row-level updates, schema evolution, hidden partitioning, and snapshot-based versioning — over storage that is fundamentally immutable and object-scoped. Originally open-sourced in 2017.

Why it exists (per the S3-at-19 post)

Parquet became the de-facto tabular on-object format in the early 2010s, combined with Hadoop/Hive for "data lake" tables. But customer demand moved beyond append-only scans to:

  • Row-level insert/update without rewriting entire tables.
  • Schema evolution (add/remove/rename columns without migration).
  • Time travel / versioning — query the state of the table at a specific point in the past.

These are hard to achieve directly over immutable objects — see concepts/immutable-object-storage.

(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)

How Iceberg solves it

A metadata layer over the data objects: each logical table version is a snapshot that describes the set of data files making up the table at that moment. Mutations produce a new snapshot that references mostly-unchanged files plus the small delta; consumers read by resolving the current snapshot to its file set.

Result:

  • Small updates don't require rewriting the whole table.
  • The table is implicitly versioned — stepping backward/forward in time is reading an older/newer snapshot pointer.
  • Snapshots are the atomic unit, which gives databases the transactional semantics they expect.

Externalisation cost (the gap S3 Tables closes)

Because Iceberg's structure is externalised — customer code owns the data/metadata object relationships — several burdens fall on the customer:

  • Compaction to merge small snapshot deltas into larger files and recover scan performance.
  • Garbage collection to reclaim space from superseded snapshot files.
  • Tiering-policy awareness — S3 Intelligent-Tiering doesn't know about Iceberg's logical layout, so tiering can misbehave.

Andy Warfield (2025):

"Iceberg and other open table formats like it are effectively storage systems in their own right, but because their structure is externalised – customer code manages the relationship between iceberg data and metadata objects, and performs tasks like garbage collection – some challenges emerge."

This gap is what motivated systems/s3-tables — S3 absorbing compaction / GC / tiering as managed operations, and exposing the table itself as the first-class policy resource.

Ecosystem

  • Iceberg REST Catalog (IRC) API — standard catalog protocol Iceberg clients speak; S3 Tables added IRC support within 14 weeks of launch.
  • DuckDB Iceberg — collaboration called out in the 2025 S3 post to accelerate in-engine Iceberg reads.
  • Native readers/writers in Spark, Flink, Trino, Snowflake, Presto, and others.

Seen in

Row-level update surfaces: MERGE INTO vs INSERT OVERWRITE

Iceberg exposes two SQL surfaces for updating table state, and they operate at very different granularities:

  • INSERT OVERWRITE — replaces an entire partition (or table). Strict schema requirements (column names, types, and order must match). Good for batch full-partition refreshes; a footgun for targeted row-level updates because it rewrites orders of magnitude more data than necessary.
  • MERGE INTO — conditional row-level upsert/delete/insert against a matching condition; only touches affected rows; more flexible schema matching. Canonical fit for CDC ingest, slowly-changing dimensions, and incremental merges.

Underneath MERGE INTO, Iceberg offers two row-level update strategies:

  • Copy-on-Write (COW) — rewrite the entire data file for any row change. Strong consistency, immediate visibility, resource- intensive. See concepts/copy-on-write-merge.
  • Merge-on-Read (MOR) — write updates as separate delta files merged with base data at query time. Optimizes write performance; requires periodic compaction to keep the read side fast. See concepts/merge-on-read.

Operational prescription (patterns/merge-into-over-insert-overwrite): default to MERGE INTO over MOR for incremental workloads; reserve INSERT OVERWRITE for genuine full-partition rewrites. The load-bearing caveat is that MOR-backed MERGE INTO only stays fast if an operator runs periodic copy-on-write compaction (Iceberg's rewrite_data_files action, or — at exabyte scale — systems/deltacat).

(Source: sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite)

Seen in (additional)

Last updated · 200 distilled / 1,178 read