CONCEPT Cited by 5 sources

Open Table Format¶

An open table format (OTF) is a metadata layer over columnar data files on object storage that adds table semantics — atomic row-level updates, schema evolution, time-travel / snapshot versioning — to a data model that is fundamentally whole-object and immutable. Canonical implementations: Apache Iceberg, Delta Lake, Apache Hudi.

(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)

The gap OTFs fill¶

concepts/immutable-object-storage offers strong primitives (replication, durability, versioning, simple semantics) but stops at the object boundary. Analytical workloads want table-level primitives — mutate a row, add a column, query "as of yesterday". Implementing these by rewriting entire tables on every change is prohibitive.

The OTF pattern decouples:

Data files (typically systems/apache-parquet) — immutable, columnar, written once.
Metadata layer — a snapshot manifest describing "which set of data files constitutes the table at version N". Mutations produce a new manifest that references mostly the same files plus deltas.

Properties this enables¶

Row-level insert/update/delete — expressed as a new snapshot that adds delta files, not by rewriting bulk Parquet.
Schema evolution — the manifest carries the logical schema; data files carry physical column layouts; readers resolve.
Time travel / branching — snapshots are addressable by id or timestamp; "read the table as of point X" is just following an older manifest pointer.
Atomic multi-file commits — the commit is a single metadata update, even though it references many data files.

The externalisation cost¶

Because the metadata + compaction + GC loop runs in customer code, several operational burdens sit outside the platform:

Compaction — snapshot-based small updates fragment the table; periodic compaction passes are needed to keep scan performance up.
Garbage collection — superseded snapshots and their unreferenced files have to be reclaimed by a customer-owned job.
Storage-feature mismatch — object-level features (S3 Intelligent-Tiering, cross-region replication) don't know the logical table; they can tier or replicate inconsistently.
Access control — IAM / ACLs are typically object-scoped; the logical table isn't a policy resource.

Warfield's 2025 framing: customers "were really… building their own table primitive over S3 objects." systems/s3-tables is S3 absorbing those responsibilities so the table becomes a first-class storage resource.

Trade-off axis¶

OTFs let customers keep their data in an open format (so any engine can read it), at the cost of running the table-management loop themselves. Managed offerings (S3 Tables, Databricks Unity Catalog, Snowflake-managed Iceberg) reduce that cost but reintroduce a form of platform coupling — typically at the catalog and compaction policy level.

Seen in¶

sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — origin story of Iceberg-on-S3, gaps of customer-managed OTFs, and S3's response.
sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Iceberg (AWS producer) and Delta (Azure consumers) used in one mesh, with format translation happening at the Unity-Catalog federation boundary; Delta Deep Clone as the incremental-replication primitive between OTFs.
sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's internal copy-on-write compactor (in production since 2019 on Spark; 2024 on Ray) is an older sibling of these open formats. BDT's Flash Compactor, contributed to systems/deltacat, is designed to extend the same copy-on-write merge benefits to managed systems/apache-iceberg / systems/apache-hudi / systems/delta-lake catalogs on Ray. The post itself credits Iceberg and Hudi with canonicalising the "copy-on-write merge" vocabulary (concepts/copy-on-write-merge).
sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite — Expedia Group Tech primer on Iceberg update strategies: names all three surfaces the OTF pattern exposes (INSERT OVERWRITE at partition grain, COW and MOR at row grain via MERGE INTO) and prescribes MERGE INTO + MOR as the default for CDC / SCD / incremental workloads. A practical operator's-eye view of the three-way choice that the OTF metadata layer makes available.
sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — OTFs as the table-semantics layer beneath the Medallion Architecture. Redpanda's pedagogy explainer canonicalises the two-layer storage substrate — open file formats (Parquet / ORC) for object-level columnar layout + OTFs (Iceberg) for the metadata layer — as the storage foundation on which the three-tier Bronze/Silver/Gold data-quality progression lives. Verbatim: "storing only data in open file formats wouldn't be sufficient. We need a metadata layer on top of it to provide transactional guarantees, schema evolution, and many more. This is where table formats come into the scene." Also canonicalises the novel architectural move where a streaming broker (Redpanda via Iceberg topics) writes into the OTF directly, making the broker serve as the Bronze tier without external ETL (patterns/streaming-broker-as-lakehouse-bronze-sink).