Skip to content

SYSTEM Cited by 1 source

DeltaCAT

DeltaCAT is an open-source Ray project (ray-project/deltacat) that provides Ray-native data-catalog and compaction tooling for open table formats on object storage — initially Amazon's internal catalog, with the stated goal of generalising to systems/apache-iceberg, systems/apache-hudi, and systems/delta-lake.

Amazon Retail's Business Data Technologies (BDT) team contributed their Ray-based compactor — internally called "The Flash Compactor" — and its design document as a first step toward letting other users realise similar benefits when using Ray on EC2 to manage open catalogs.

What DeltaCAT provides

Scale it's run at (Amazon BDT, Q1 2024)

Running DeltaCAT-lineage code at production scale within Amazon Retail:

  • 1.5 EiB input Parquet compacted from S3, corresponding to 4 EiB of in-memory Apache Arrow.
  • >10,000 vCPU-years of EC2 compute consumed in a single quarter.
  • Single-job clusters up to 26,846 vCPUs / 210 TiB RAM.
  • >20 PiB/day input across >1,600 Ray jobs/day.
  • Average job reads >10 TiB input and completes <7 min including cluster setup/teardown.
  • Efficiency gain vs the prior Spark compactor: 82% better cost per GiB of input.

(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Ecosystem improvements

The BDT team works jointly with the systems/daft project on Ray data-I/O improvements. One reported outcome: S3 Parquet + delimited text I/O via Daft + Ray gave +24% production cost-efficiency vs. Ray without Daft. Read-level benchmarks: median single-column read −55% vs PyArrow, −91% vs S3Fs; median full-file read −19% vs PyArrow, −77% vs S3Fs.

Seen in

Last updated · 200 distilled / 1,178 read