Skip to content

CONCEPT Cited by 1 source

Memory-aware scheduling

Memory-aware scheduling is the scheduler policy of sizing and placing work against memory as a first-class resource, not CPU alone. In distributed compute frameworks, tasks declare an expected memory footprint at invocation time, and the scheduler packs workers so that the sum of expected footprints fits the worker's physical memory budget.

Why it matters

  • In many columnar / merge / ML workloads, memory — not CPU — is the bottleneck. Packing by CPU over-commits memory and produces OOM kills instead of queueing.
  • Without the hint, schedulers default to CPU packing and over-commit; OOM is not graceful — it terminates the task, throws away progress, and often triggers a cascading worker restart.
  • With a credible memory hint, the scheduler can reject or delay the task rather than OOM. Workloads with steady memory profiles can therefore run at much higher cluster utilisation without the tail of OOM failures.

Ray documents this explicitly under memory-aware-scheduling.

Sourcing the expected memory

  • Past-trends lookup — run history for this task class / input shape, use p50 or p99 of past memory peaks as the hint.
  • Static upper bound from the plan — for dataflow-shaped work, reason about the plan's row-group size × column count × input count.
  • Dynamic introspection — run a cheap sizing pass to measure input data before scheduling the heavy pass.

Amazon BDT's experience

In the 2022 exabyte-scale data-quality service (Amazon BDT's Ray test-bed before business-critical compaction), OOM errors were the dominant failure class. Resolved by telegraphing expected memory usage at Ray task invocation time via memory-aware scheduling, based on past trends — i.e. running-history-backed hints.

(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Why it caps efficiency gains

The flip side: BDT's Q1 2024 average cluster-memory utilisation was 54.6% of the 36 TiB allocated. Ray's efficiency gain vs Spark (82%) is capped by that headroom: if memory could safely be run at 90%+, the efficiency gap would widen to >90%. That delta is what the next generation of the Flash Compactor targets by tightening memory-aware-scheduling estimates + pipelining.

Seen in

  • sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named as the mechanism Amazon BDT used to resolve the OOM failures they hit when first running Ray at exabyte scale on their data-quality service (2022), and referenced as the lever that can close the gap from 54.6% → 90% memory utilisation (and therefore the 82% → 90%+ cost-efficiency improvement) in the Flash Compactor's next revision.
Last updated · 200 distilled / 1,178 read