Skip to content

CONCEPT Cited by 1 source

Open File Format

An open file format is a non-proprietary, language-agnostic, object-level data representation for tabular data on storage. The canonical pair in the 2020s:

  • Apache Parquet (2013) — the de-facto columnar format on cloud object stores. Row-group + column-chunk layout, per-column compression/encoding, min/max statistics for row-group skipping.
  • ORC (Optimized Row Columnar) (2013) — Hortonworks-originated alternative, similar columnar design with stripes, indexes, and ACID-awareness primitives baked in. Historically strong in the Hadoop/Hive ecosystem; Parquet has broader non-Hadoop uptake.

Both sit at the object-level layer in a lakehouse stack — below an open table format (Iceberg / Delta Lake / Hudi) and above raw object storage (S3 / GCS / ADLS).

What "open" means here

  • Specification is public — the binary layout is documented and multiple independent implementations exist (Java, C++, Rust, Python, Go).
  • No vendor lock-in at the storage layer — customers can read data written by one engine using a different engine without re-serialisation.
  • Language-agnostic ecosystem — readers/writers in every major language used for data work.

Why open file formats matter for lakehouse architectures

  1. Efficient columnar queries on object storage — analytical queries touch few columns; columnar layout lets readers fetch only the needed columns (typically 10-100× I/O reduction vs row format).
  2. Predicate pushdown + row-group skipping — per-column statistics let readers skip entire row groups without reading them.
  3. Per-column compression — columnar value locality gives compression ratios 2-10× better than row-format general-purpose compression.
  4. Multi-engine compatibility — the same Parquet file is readable by Spark, Flink, Trino, Presto, ClickHouse, DuckDB, and a dozen others. Decouples storage from compute choice.

The gap open file formats don't close

File formats are object-level. They know nothing about:

  • How many files together constitute a logical table
  • Atomic multi-file commits (writing N files and having them become visible simultaneously)
  • Schema evolution (renaming a column across all files)
  • Time travel (reading the table as of a past point)
  • Partitioning as a first-class table property
  • ACID row-level updates / deletes

These are table-level concerns, which is why open table formats (concepts/open-table-format, canonicalised by Iceberg / Delta Lake / Hudi) emerged as a metadata layer on top:

"storing only data in open file formats wouldn't be sufficient. We need a metadata layer on top of it to provide transactional guarantees, schema evolution, and many more. This is where table formats come into the scene." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

The resulting two-layer substrate:

┌─────────────────────────────────────────┐
│  Open Table Format (Iceberg / Delta)    │  ← metadata, ACID, schema evolution
├─────────────────────────────────────────┤
│  Open File Format (Parquet / ORC)       │  ← columnar layout
├─────────────────────────────────────────┤
│  Object Storage (S3 / GCS / ADLS)       │  ← durable, scalable bytes
└─────────────────────────────────────────┘

Seen in

Last updated · 470 distilled / 1,213 read