CONCEPT Cited by 1 source
Open File Format¶
An open file format is a non-proprietary, language-agnostic, object-level data representation for tabular data on storage. The canonical pair in the 2020s:
- Apache Parquet (2013) — the de-facto columnar format on cloud object stores. Row-group + column-chunk layout, per-column compression/encoding, min/max statistics for row-group skipping.
- ORC (Optimized Row Columnar) (2013) — Hortonworks-originated alternative, similar columnar design with stripes, indexes, and ACID-awareness primitives baked in. Historically strong in the Hadoop/Hive ecosystem; Parquet has broader non-Hadoop uptake.
Both sit at the object-level layer in a lakehouse stack — below an open table format (Iceberg / Delta Lake / Hudi) and above raw object storage (S3 / GCS / ADLS).
What "open" means here¶
- Specification is public — the binary layout is documented and multiple independent implementations exist (Java, C++, Rust, Python, Go).
- No vendor lock-in at the storage layer — customers can read data written by one engine using a different engine without re-serialisation.
- Language-agnostic ecosystem — readers/writers in every major language used for data work.
Why open file formats matter for lakehouse architectures¶
- Efficient columnar queries on object storage — analytical queries touch few columns; columnar layout lets readers fetch only the needed columns (typically 10-100× I/O reduction vs row format).
- Predicate pushdown + row-group skipping — per-column statistics let readers skip entire row groups without reading them.
- Per-column compression — columnar value locality gives compression ratios 2-10× better than row-format general-purpose compression.
- Multi-engine compatibility — the same Parquet file is readable by Spark, Flink, Trino, Presto, ClickHouse, DuckDB, and a dozen others. Decouples storage from compute choice.
The gap open file formats don't close¶
File formats are object-level. They know nothing about:
- How many files together constitute a logical table
- Atomic multi-file commits (writing N files and having them become visible simultaneously)
- Schema evolution (renaming a column across all files)
- Time travel (reading the table as of a past point)
- Partitioning as a first-class table property
- ACID row-level updates / deletes
These are table-level concerns, which is why open table formats (concepts/open-table-format, canonicalised by Iceberg / Delta Lake / Hudi) emerged as a metadata layer on top:
"storing only data in open file formats wouldn't be sufficient. We need a metadata layer on top of it to provide transactional guarantees, schema evolution, and many more. This is where table formats come into the scene." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)
The resulting two-layer substrate:
┌─────────────────────────────────────────┐
│ Open Table Format (Iceberg / Delta) │ ← metadata, ACID, schema evolution
├─────────────────────────────────────────┤
│ Open File Format (Parquet / ORC) │ ← columnar layout
├─────────────────────────────────────────┤
│ Object Storage (S3 / GCS / ADLS) │ ← durable, scalable bytes
└─────────────────────────────────────────┘
Seen in¶
- sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonicalises the open-file-format / open-table-format distinction as the two-layer storage substrate beneath the Medallion Architecture. Names Parquet and ORC as the file-format pair; Iceberg as the table-format layer on top.
Related¶
- concepts/open-table-format — the metadata layer above.
- concepts/data-lakehouse — the architectural class that relies on both.
- concepts/medallion-architecture — the pattern that organises data across layers, all sitting on Parquet / ORC storage.
- systems/apache-parquet — canonical Parquet entry.
- systems/apache-iceberg — the table format most-paired with Parquet in practice.