Skip to content

SYSTEM Cited by 7 sources

Apache Parquet

Apache Parquet (2013) is a columnar on-disk file format for tabular data. It became the de-facto object-level format for tables on cloud object stores, enabling the "data lake over S3" pattern at scale — and the basis on which richer table formats like systems/apache-iceberg are built.

Why it won

  • Columnar layout — reads only the columns needed by a query, cutting I/O dramatically for analytical workloads.
  • Per-column compression and encoding (dictionary, RLE, delta), exploiting columnar value locality.
  • Statistics per row-group — min/max/null-count let readers skip entire row groups that can't match a predicate.
  • Language-agnostic — Java, C++, Python, Rust, Go all have mature readers/writers; no proprietary lock-in.
  • Good fit for immutable object storage — one Parquet file per object, append-oriented write pattern, no in-place updates required.

(Source: sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes)

Scale (per the S3-at-19 post, 2025)

"S3 stores exabytes of parquet data and serves hundreds of petabytes of Parquet data every day."

This is the rare combination of an open format that has become de-facto infrastructure. It's why Iceberg and Delta Lake both adopted Parquet as their data file layer — piggybacking on decade-plus of reader/writer maturity and installed base.

Where Parquet stops and table formats begin

Parquet answers "how do I store a row-group of rows efficiently in one object?" It does not answer:

  • How do I mutate individual rows without rewriting the object?
  • How do I evolve the schema across many objects?
  • How do I version the logical table?
  • How do I atomically commit a set of objects as "the new table state"?

These are the questions an open table format like systems/apache-iceberg layers on top — typically by writing a metadata / snapshot layer that points at Parquet data files.

Seen in

  • sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — Parquet framed as the on-object data layer under Iceberg; cited at exabyte-stored / hundreds-of-petabytes-served-per-day scale on S3.
  • sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale — Datadog's Husky uses a Parquet-like custom columnar format ("similar to Parquet with one row group and many pages, but specially designed for observability data"). Notable deltas vs. stock Parquet: inline column headers for streaming-discovery during compaction (vs. Parquet's footer-at-end), adaptive row-group size sized against the heaviest input column (logs message up to 75 KiB/event), and per-column fragment-metadata that goes beyond min/max to a trimmed-FSA-regex (patterns/trimmed-automaton-predicate-filter).
  • sources/2026-04-07-allthingsdistributed-s3-files-and-the-changing-face-of-s3 — Warfield cites Parquet's scale on S3 as the structural-data context for the 2024-2026 multi-primitive expansion: S3 "stores exabytes of parquet data and averages over 25 million requests per second to that format alone." The magnitude of that installed base is the reason Iceberg-over-Parquet became a de-facto table layer and why S3 Tables absorbed the managed-Iceberg role.
  • sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's Ray compactor reads Parquet from S3 and materialises to systems/apache-arrow in-memory. Q1 2024: 1.5 EiB of Parquet input decoded into ~4 EiB of in-memory Arrow during compaction. Joint optimisation with systems/daft on Parquet I/O yielded +24% production cost-efficiency; median single-column Parquet read was −55% vs PyArrow and −91% vs S3Fs. One of the largest public Parquet-at-scale numbers outside S3's own fleet-wide exabyte / 25M-rps statistic.
  • sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-mapleParquet as intermediate storage in a batch-LLM pipeline. Instacart's Maple splits large CSV inputs into Parquet per-batch files on S3, encodes each batch into the LLM provider's format, and stores per-batch results back as Parquet before the final merge. Post cites Parquet specifically for up to 25× size reduction vs CSV + non-linear (random-access) reads into the file. The design shape — CSV-at-the-boundary / Parquet-internal / output-format-mirrors- input — is canonicalised as patterns/csv-in-parquet-intermediate-output-merge. Different use-case from data-lake / analytics Parquet (this is transient intermediate storage for a multi-step batch pipeline, not an append-only analytic table); same format wins for the same compression + columnar-random-access reasons.
  • sources/2025-09-26-yelp-s3-server-access-logs-at-scaleParquet as the compaction target for raw-text access-log volumes at fleet scale. Yelp converts TiBs/day of raw-text S3 Server Access Logs into Parquet via daily Athena INSERT batches, reporting 85 % storage reduction and 99.99 % object-count reduction — two headline datapoints for the raw-to-columnar log compaction pattern. Canonicalises Parquet's row-group metadata pruning as the load-bearing query-engine benefit over raw text — "It includes metadata that allows skipping row groups or pages based on filter criteria which reduces data scanned." Distinct from the data-lake / analytics use case and the Maple transient-intermediate use case: Parquet here is the permanent-warm-tier compacted form of a best-effort-delivered log stream, queried via Athena for debugging / cost attribution / incident response over a retention window longer than the measured SAL straggler tail. Also cited as the substrate that makes Athena's post-query count verification fast (via GetQueryRuntimeStatistics.Rows and "count query on compacted tables, which is fast due to parquet format").
Last updated · 542 distilled / 1,571 read