SYSTEM Cited by 1 source
Daft¶
Daft (github.com/Eventual-Inc/Daft) is a Python + Rust distributed DataFrame library optimised for multimodal and columnar data on cloud object storage. Ships its own Rust-based Parquet + I/O stack and integrates with systems/ray as a runtime. Its S3 Parquet + delimited-text reader is a notable performance win over PyArrow and S3Fs for this class of workload.
Benchmarked performance (Amazon BDT, 2024)¶
Joint work with Amazon's BDT team produced a 24% production cost efficiency improvement on Ray compaction when Daft replaced the prior I/O stack. Microbenchmarks from the same post:
| Operation | Daft vs PyArrow | Daft vs S3Fs |
|---|---|---|
| Median single-column read | −55% | −91% |
| Median full-file read | −19% | −77% |
(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Daft's optimised S3 Parquet + delimited-text reader drops into Amazon BDT's Ray compaction pipeline for an additional +24% production cost efficiency on top of Ray alone.
Related¶
- systems/ray — distributed runtime.
- systems/apache-arrow, systems/apache-parquet — core formats Daft reads/writes.
- systems/aws-s3 — primary storage target.
- systems/deltacat — the compaction surface in which Daft's wins materialise.