SYSTEM Cited by 1 source
Ray¶
Ray is an open-source distributed compute framework out of UC Berkeley's RISELab (arXiv:1712.05889). It exposes a low-level tasks + actors programming model on top of a distributed object store with zero-copy intranode sharing and an autoscaling + locality-aware scheduler. Originally pitched as a framework for scaling ML workloads, it has since been shown — decisively at exabyte scale — to also be a world-class specialist data-processing substrate.
Programming model¶
Ray's primitives are deliberately lower-level than Spark's dataflow abstractions (concepts/task-and-actor-model):
- Tasks — stateless remote function invocations.
- Actors — stateful remote classes.
- Object store — a horizontally-scalable store of immutable serialized objects, keyed by object-ref. One per node, pooled across the cluster.
- Zero-copy intranode sharing — references to objects in the local node's store are handed to co-located tasks/actors without re-serialization.
- Locality-aware scheduler — prefers to place a task on the node that already holds its input references, minimising cross-node shuffle (concepts/locality-aware-scheduling).
- Autoscaling clusters — fine-grained add/remove of workers matching load; supports heterogeneous instance types.
- Memory-aware scheduling — tasks can declare expected memory usage at invocation time, letting the scheduler pack by the real bottleneck resource rather than CPU alone (concepts/memory-aware-scheduling).
Positioning vs. Apache Spark¶
- Spark offers rich, safe, abstract dataflow operators; great for generalist data processing; tuned primarily for RAM+disk clusters.
- Ray offers a lower-level escape hatch: you hand-craft the distributed program. Reward: for specialist workloads you can beat Spark by large factors on both throughput and cost. Cost: no free optimiser, no SQL, application complexity is yours.
In Amazon BDT's case the ability to copy untouched S3 files by reference during a distributed merge (patterns/reference-based-copy-optimization) and to locality-schedule shuffles with zero-copy intranode sharing produced an 82% better cost efficiency per GiB of S3 input vs the prior Spark compactor (sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2).
Production operational realities (from the BDT migration)¶
- Out-of-memory errors were the most common failure mode at exabyte scale; resolved by memory-aware scheduling using past workload memory profiles to size tasks.
- EC2 instance-management pain — Ray clusters wanted more EC2 instances than a single instance type could reliably supply; resolved by heterogeneous cluster provisioning (patterns/heterogeneous-cluster-provisioning): pick a set of instance types that can meet the resource shape, then provision whichever are most readily available across AZs. Trade-off: applications must not assume a fixed CPU arch / disk type.
- Reliability gap vs Spark is real: BDT reports 2023 Ray first-time success rate up to 15% behind Spark; 2024 average 99.15% vs Spark's 99.91% (0.76 pp). Closing that gap took years.
- Memory utilisation caps the efficiency win — BDT averaged 54.6% cluster-memory utilisation in Q1 2024; closing to 90% would raise Ray vs Spark efficiency gap from 82% → >90%.
Ray's reach¶
- ML training/serving/hyperparameter-search was its original positioning.
- Data processing at scale: Exoshuffle (arXiv:2203.05072) holds the 2022 Cloud Terasort cost record. Amazon BDT runs exabyte-scale production compaction on it.
- Open-source data-catalog tooling: systems/deltacat (Ray project) targets managed open table formats — systems/apache-iceberg, systems/apache-hudi, systems/delta-lake.
- Serverless Ray: systems/anyscale-platform (commercial Ray company) and systems/aws-glue-for-ray (AWS managed) mean users no longer need to build their own serverless-Ray job management.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Ray as the successor to Spark for Amazon Retail BDT's exabyte-scale compactor. Q1 2024: 1.5 EiB input Parquet, 4 EiB in-memory Arrow, >10,000 vCPU-years, 26,846-vCPU / 210-TiB RAM peak cluster; 82% better cost efficiency per GiB vs Spark; 100% on-time delivery across Q4 2023 + Q1 2024; 99.15% first-time success (0.76 pp behind Spark). Ray Summit 2021 talk
- petabyte-scale datalake slides.
Related¶
- systems/apache-spark — the generalist Ray is displacing in specialist Amazon-internal workloads.
- systems/deltacat — open-source Ray project for catalogued table-format compaction; Amazon contributed The Flash Compactor.
- systems/anyscale-platform, systems/aws-glue-for-ray — managed-Ray runtimes.
- systems/apache-arrow — in-memory columnar format Ray's DeltaCAT workloads materialise into.
- concepts/task-and-actor-model, concepts/locality-aware-scheduling, concepts/memory-aware-scheduling — the mechanism-level concepts.
- patterns/heterogeneous-cluster-provisioning, patterns/reference-based-copy-optimization — deployment and optimisation patterns seen in production Ray use.