CONCEPT Cited by 1 source

Open File Format¶

An open file format is a non-proprietary, language-agnostic, object-level data representation for tabular data on storage. The canonical pair in the 2020s:

Apache Parquet (2013) — the de-facto columnar format on cloud object stores. Row-group + column-chunk layout, per-column compression/encoding, min/max statistics for row-group skipping.
ORC (Optimized Row Columnar) (2013) — Hortonworks-originated alternative, similar columnar design with stripes, indexes, and ACID-awareness primitives baked in. Historically strong in the Hadoop/Hive ecosystem; Parquet has broader non-Hadoop uptake.

Both sit at the object-level layer in a lakehouse stack — below an open table format (Iceberg / Delta Lake / Hudi) and above raw object storage (S3 / GCS / ADLS).

What "open" means here¶

Specification is public — the binary layout is documented and multiple independent implementations exist (Java, C++, Rust, Python, Go).
No vendor lock-in at the storage layer — customers can read data written by one engine using a different engine without re-serialisation.
Language-agnostic ecosystem — readers/writers in every major language used for data work.

Why open file formats matter for lakehouse architectures¶

Efficient columnar queries on object storage — analytical queries touch few columns; columnar layout lets readers fetch only the needed columns (typically 10-100× I/O reduction vs row format).
Predicate pushdown + row-group skipping — per-column statistics let readers skip entire row groups without reading them.
Per-column compression — columnar value locality gives compression ratios 2-10× better than row-format general-purpose compression.
Multi-engine compatibility — the same Parquet file is readable by Spark, Flink, Trino, Presto, ClickHouse, DuckDB, and a dozen others. Decouples storage from compute choice.

The gap open file formats don't close¶

File formats are object-level. They know nothing about:

How many files together constitute a logical table
Atomic multi-file commits (writing N files and having them become visible simultaneously)
Schema evolution (renaming a column across all files)
Time travel (reading the table as of a past point)
Partitioning as a first-class table property
ACID row-level updates / deletes

These are table-level concerns, which is why open table formats (concepts/open-table-format, canonicalised by Iceberg / Delta Lake / Hudi) emerged as a metadata layer on top:

"storing only data in open file formats wouldn't be sufficient. We need a metadata layer on top of it to provide transactional guarantees, schema evolution, and many more. This is where table formats come into the scene." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

The resulting two-layer substrate:

┌─────────────────────────────────────────┐
│  Open Table Format (Iceberg / Delta)    │  ← metadata, ACID, schema evolution
├─────────────────────────────────────────┤
│  Open File Format (Parquet / ORC)       │  ← columnar layout
├─────────────────────────────────────────┤
│  Object Storage (S3 / GCS / ADLS)       │  ← durable, scalable bytes
└─────────────────────────────────────────┘

Seen in¶

sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonicalises the open-file-format / open-table-format distinction as the two-layer storage substrate beneath the Medallion Architecture. Names Parquet and ORC as the file-format pair; Iceberg as the table-format layer on top.

concepts/open-table-format — the metadata layer above.
concepts/data-lakehouse — the architectural class that relies on both.
concepts/medallion-architecture — the pattern that organises data across layers, all sitting on Parquet / ORC storage.
systems/apache-parquet — canonical Parquet entry.
systems/apache-iceberg — the table format most-paired with Parquet in practice.

Open File Format¶

What "open" means here¶

Why open file formats matter for lakehouse architectures¶

The gap open file formats don't close¶

Seen in¶

Related¶