Skip to content

SYSTEM Cited by 1 source

Ray

Ray is an open-source distributed compute framework out of UC Berkeley's RISELab (arXiv:1712.05889). It exposes a low-level tasks + actors programming model on top of a distributed object store with zero-copy intranode sharing and an autoscaling + locality-aware scheduler. Originally pitched as a framework for scaling ML workloads, it has since been shown — decisively at exabyte scale — to also be a world-class specialist data-processing substrate.

Programming model

Ray's primitives are deliberately lower-level than Spark's dataflow abstractions (concepts/task-and-actor-model):

  • Tasks — stateless remote function invocations.
  • Actors — stateful remote classes.
  • Object store — a horizontally-scalable store of immutable serialized objects, keyed by object-ref. One per node, pooled across the cluster.
  • Zero-copy intranode sharing — references to objects in the local node's store are handed to co-located tasks/actors without re-serialization.
  • Locality-aware scheduler — prefers to place a task on the node that already holds its input references, minimising cross-node shuffle (concepts/locality-aware-scheduling).
  • Autoscaling clusters — fine-grained add/remove of workers matching load; supports heterogeneous instance types.
  • Memory-aware scheduling — tasks can declare expected memory usage at invocation time, letting the scheduler pack by the real bottleneck resource rather than CPU alone (concepts/memory-aware-scheduling).

Positioning vs. Apache Spark

  • Spark offers rich, safe, abstract dataflow operators; great for generalist data processing; tuned primarily for RAM+disk clusters.
  • Ray offers a lower-level escape hatch: you hand-craft the distributed program. Reward: for specialist workloads you can beat Spark by large factors on both throughput and cost. Cost: no free optimiser, no SQL, application complexity is yours.

In Amazon BDT's case the ability to copy untouched S3 files by reference during a distributed merge (patterns/reference-based-copy-optimization) and to locality-schedule shuffles with zero-copy intranode sharing produced an 82% better cost efficiency per GiB of S3 input vs the prior Spark compactor (sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2).

Production operational realities (from the BDT migration)

  • Out-of-memory errors were the most common failure mode at exabyte scale; resolved by memory-aware scheduling using past workload memory profiles to size tasks.
  • EC2 instance-management pain — Ray clusters wanted more EC2 instances than a single instance type could reliably supply; resolved by heterogeneous cluster provisioning (patterns/heterogeneous-cluster-provisioning): pick a set of instance types that can meet the resource shape, then provision whichever are most readily available across AZs. Trade-off: applications must not assume a fixed CPU arch / disk type.
  • Reliability gap vs Spark is real: BDT reports 2023 Ray first-time success rate up to 15% behind Spark; 2024 average 99.15% vs Spark's 99.91% (0.76 pp). Closing that gap took years.
  • Memory utilisation caps the efficiency win — BDT averaged 54.6% cluster-memory utilisation in Q1 2024; closing to 90% would raise Ray vs Spark efficiency gap from 82% → >90%.

Ray's reach

Seen in

Last updated · 200 distilled / 1,178 read