SYSTEM Cited by 1 source

Apache Arrow¶

Apache Arrow is a language-agnostic in-memory columnar format for tabular data. Where systems/apache-parquet is the on-disk/object-storage columnar format, Arrow is the in-process exchange format that processes, languages, and engines use to avoid re-serialising data when handing it to one another. It underpins the PyArrow / Arrow Flight / Arrow DataFusion / Polars / Ray data-plane ecosystem and is the zero-copy interchange format inside most modern columnar data-processing stacks.

Why it matters¶

Zero-copy intranode sharing — processes on the same host (or across language runtimes like Python ↔ Rust ↔ C++) can read the same Arrow buffer without copying. This is what makes Python-on-C++ pipelines (PyArrow over Arrow-C++ readers) competitive with native code.
Vectorised execution — fixed-width Arrow arrays make SIMD / branch-predictable column processing natural.
Read Parquet once, share many times — Parquet → Arrow decode happens at read time; downstream steps re-use the Arrow buffers without further decode.

Scale evidence¶

Amazon BDT's Q1 2024 Ray compaction: 1.5 EiB of Parquet on S3 decoded into ~4 EiB of in-memory Apache Arrow during processing — a conservative lower-bound for real-world Arrow throughput at a single organisation. (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Arrow as the in-memory representation during Ray-based compaction on Amazon BDT; 4 EiB / quarter in-memory volume. Ray's zero-copy intranode object sharing and locality-aware scheduler both operate over Arrow buffers.

systems/apache-parquet — the on-object format Arrow encodes / decodes against.
systems/ray — Ray's object store + zero-copy sharing targets Arrow buffers.
systems/deltacat — Ray DeltaCAT materialises Parquet into Arrow during merge.
concepts/columnar-storage-format — the broader category (disk + memory).
concepts/zero-copy-sharing — the in-memory exchange mechanism.

Apache Arrow¶

Why it matters¶

Scale evidence¶

Seen in¶

Related¶