CONCEPT Cited by 4 sources
Open Table Format¶
An open table format (OTF) is a metadata layer over columnar data files on object storage that adds table semantics — atomic row-level updates, schema evolution, time-travel / snapshot versioning — to a data model that is fundamentally whole-object and immutable. Canonical implementations: Apache Iceberg, Delta Lake, Apache Hudi.
(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)
The gap OTFs fill¶
concepts/immutable-object-storage offers strong primitives (replication, durability, versioning, simple semantics) but stops at the object boundary. Analytical workloads want table-level primitives — mutate a row, add a column, query "as of yesterday". Implementing these by rewriting entire tables on every change is prohibitive.
The OTF pattern decouples:
- Data files (typically systems/apache-parquet) — immutable, columnar, written once.
- Metadata layer — a snapshot manifest describing "which set of data files constitutes the table at version N". Mutations produce a new manifest that references mostly the same files plus deltas.
Properties this enables¶
- Row-level insert/update/delete — expressed as a new snapshot that adds delta files, not by rewriting bulk Parquet.
- Schema evolution — the manifest carries the logical schema; data files carry physical column layouts; readers resolve.
- Time travel / branching — snapshots are addressable by id or timestamp; "read the table as of point X" is just following an older manifest pointer.
- Atomic multi-file commits — the commit is a single metadata update, even though it references many data files.
The externalisation cost¶
Because the metadata + compaction + GC loop runs in customer code, several operational burdens sit outside the platform:
- Compaction — snapshot-based small updates fragment the table; periodic compaction passes are needed to keep scan performance up.
- Garbage collection — superseded snapshots and their unreferenced files have to be reclaimed by a customer-owned job.
- Storage-feature mismatch — object-level features (S3 Intelligent-Tiering, cross-region replication) don't know the logical table; they can tier or replicate inconsistently.
- Access control — IAM / ACLs are typically object-scoped; the logical table isn't a policy resource.
Warfield's 2025 framing: customers "were really… building their own table primitive over S3 objects." systems/s3-tables is S3 absorbing those responsibilities so the table becomes a first-class storage resource.
Trade-off axis¶
OTFs let customers keep their data in an open format (so any engine can read it), at the cost of running the table-management loop themselves. Managed offerings (S3 Tables, Databricks Unity Catalog, Snowflake-managed Iceberg) reduce that cost but reintroduce a form of platform coupling — typically at the catalog and compaction policy level.
Seen in¶
- sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — origin story of Iceberg-on-S3, gaps of customer-managed OTFs, and S3's response.
- sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Iceberg (AWS producer) and Delta (Azure consumers) used in one mesh, with format translation happening at the Unity-Catalog federation boundary; Delta Deep Clone as the incremental-replication primitive between OTFs.
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's internal copy-on-write compactor (in production since 2019 on Spark; 2024 on Ray) is an older sibling of these open formats. BDT's Flash Compactor, contributed to systems/deltacat, is designed to extend the same copy-on-write merge benefits to managed systems/apache-iceberg / systems/apache-hudi / systems/delta-lake catalogs on Ray. The post itself credits Iceberg and Hudi with canonicalising the "copy-on-write merge" vocabulary (concepts/copy-on-write-merge).
- sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
— Expedia Group Tech primer on Iceberg update strategies: names
all three surfaces the OTF pattern exposes (
INSERT OVERWRITEat partition grain, COW and MOR at row grain viaMERGE INTO) and prescribesMERGE INTO+ MOR as the default for CDC / SCD / incremental workloads. A practical operator's-eye view of the three-way choice that the OTF metadata layer makes available.