Skip to content

SYSTEM Cited by 1 source

Apache Arrow

Apache Arrow is a language-agnostic in-memory columnar format for tabular data. Where systems/apache-parquet is the on-disk/object-storage columnar format, Arrow is the in-process exchange format that processes, languages, and engines use to avoid re-serialising data when handing it to one another. It underpins the PyArrow / Arrow Flight / Arrow DataFusion / Polars / Ray data-plane ecosystem and is the zero-copy interchange format inside most modern columnar data-processing stacks.

Why it matters

  • Zero-copy intranode sharing — processes on the same host (or across language runtimes like Python ↔ Rust ↔ C++) can read the same Arrow buffer without copying. This is what makes Python-on-C++ pipelines (PyArrow over Arrow-C++ readers) competitive with native code.
  • Vectorised execution — fixed-width Arrow arrays make SIMD / branch-predictable column processing natural.
  • Read Parquet once, share many times — Parquet → Arrow decode happens at read time; downstream steps re-use the Arrow buffers without further decode.

Scale evidence

Amazon BDT's Q1 2024 Ray compaction: 1.5 EiB of Parquet on S3 decoded into ~4 EiB of in-memory Apache Arrow during processing — a conservative lower-bound for real-world Arrow throughput at a single organisation. (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Seen in

Last updated · 200 distilled / 1,178 read