SYSTEM Cited by 1 source
Amazon EMR¶
Amazon EMR (Elastic MapReduce) is AWS's managed Hadoop-ecosystem runtime — hosts for systems/apache-spark, systems/apache-hive, Presto, Flink, HBase, and other OSS big-data engines on systems/aws-ec2 (and more recently on EKS and serverless). It is the canonical "big data cluster as a service" on AWS and the substrate behind much of the post-Hadoop data-lake workload on systems/aws-s3.
Role for this wiki¶
EMR typically shows up as the thing you were running Spark on before something changed (a scale-out, a cost crunch, a move to a managed warehouse or a specialist engine like systems/ray). In the Amazon BDT Spark → Ray story, the Spark compactor ran on EMR clusters; Ray clusters run directly on EC2 (via the serverless job management substrate BDT built on top of systems/dynamodb + systems/aws-sns + systems/aws-sqs + systems/aws-s3).
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's original "compactor" Spark job ran on EMR; was progressively displaced by a Ray-on-EC2 compactor for the largest ~1% of tables.
Related¶
- systems/apache-spark, systems/apache-hive — the most common EMR workloads.
- systems/aws-glue — serverless Spark + catalog alternative to EMR.
- systems/aws-ec2 — EMR's compute substrate.
- systems/aws-s3 — EMR's canonical storage substrate.