Skip to content

SYSTEM Cited by 4 sources

Apache Parquet

Apache Parquet (2013) is a columnar on-disk file format for tabular data. It became the de-facto object-level format for tables on cloud object stores, enabling the "data lake over S3" pattern at scale — and the basis on which richer table formats like systems/apache-iceberg are built.

Why it won

  • Columnar layout — reads only the columns needed by a query, cutting I/O dramatically for analytical workloads.
  • Per-column compression and encoding (dictionary, RLE, delta), exploiting columnar value locality.
  • Statistics per row-group — min/max/null-count let readers skip entire row groups that can't match a predicate.
  • Language-agnostic — Java, C++, Python, Rust, Go all have mature readers/writers; no proprietary lock-in.
  • Good fit for immutable object storage — one Parquet file per object, append-oriented write pattern, no in-place updates required.

(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)

Scale (per the S3-at-19 post, 2025)

"S3 stores exabytes of parquet data and serves hundreds of petabytes of Parquet data every day."

This is the rare combination of an open format that has become de-facto infrastructure. It's why Iceberg and Delta Lake both adopted Parquet as their data file layer — piggybacking on decade-plus of reader/writer maturity and installed base.

Where Parquet stops and table formats begin

Parquet answers "how do I store a row-group of rows efficiently in one object?" It does not answer:

  • How do I mutate individual rows without rewriting the object?
  • How do I evolve the schema across many objects?
  • How do I version the logical table?
  • How do I atomically commit a set of objects as "the new table state"?

These are the questions an open table format like systems/apache-iceberg layers on top — typically by writing a metadata / snapshot layer that points at Parquet data files.

Seen in

Last updated · 200 distilled / 1,178 read