SYSTEM Cited by 1 source
Apache Arrow¶
Apache Arrow is a language-agnostic in-memory columnar format for tabular data. Where systems/apache-parquet is the on-disk/object-storage columnar format, Arrow is the in-process exchange format that processes, languages, and engines use to avoid re-serialising data when handing it to one another. It underpins the PyArrow / Arrow Flight / Arrow DataFusion / Polars / Ray data-plane ecosystem and is the zero-copy interchange format inside most modern columnar data-processing stacks.
Why it matters¶
- Zero-copy intranode sharing — processes on the same host (or across language runtimes like Python ↔ Rust ↔ C++) can read the same Arrow buffer without copying. This is what makes Python-on-C++ pipelines (PyArrow over Arrow-C++ readers) competitive with native code.
- Vectorised execution — fixed-width Arrow arrays make SIMD / branch-predictable column processing natural.
- Read Parquet once, share many times — Parquet → Arrow decode happens at read time; downstream steps re-use the Arrow buffers without further decode.
Scale evidence¶
Amazon BDT's Q1 2024 Ray compaction: 1.5 EiB of Parquet on S3 decoded into ~4 EiB of in-memory Apache Arrow during processing — a conservative lower-bound for real-world Arrow throughput at a single organisation. (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Arrow as the in-memory representation during Ray-based compaction on Amazon BDT; 4 EiB / quarter in-memory volume. Ray's zero-copy intranode object sharing and locality-aware scheduler both operate over Arrow buffers.
Related¶
- systems/apache-parquet — the on-object format Arrow encodes / decodes against.
- systems/ray — Ray's object store + zero-copy sharing targets Arrow buffers.
- systems/deltacat — Ray DeltaCAT materialises Parquet into Arrow during merge.
- concepts/columnar-storage-format — the broader category (disk + memory).
- concepts/zero-copy-sharing — the in-memory exchange mechanism.