SYSTEM Cited by 2 sources
Amazon EMR¶
Amazon EMR (Elastic MapReduce) is AWS's managed Hadoop-ecosystem runtime — hosts for systems/apache-spark, systems/apache-hive, Presto, Flink, HBase, and other OSS big-data engines on systems/aws-ec2 (and more recently on EKS and serverless). It is the canonical "big data cluster as a service" on AWS and the substrate behind much of the post-Hadoop data-lake workload on systems/aws-s3.
Role for this wiki¶
EMR typically shows up as the thing you were running Spark on before something changed (a scale-out, a cost crunch, a move to a managed warehouse or a specialist engine like systems/ray). In the Amazon BDT Spark → Ray story, the Spark compactor ran on EMR clusters; Ray clusters run directly on EC2 (via the serverless job management substrate BDT built on top of systems/dynamodb + systems/aws-sns + systems/aws-sqs + systems/aws-s3).
In Slack's 2026-05-05 SSH-deprecation retrospective, EMR is the
substrate underneath an org-wide modernisation: 700+ production
jobs across 8 independent data regions had been orchestrated by
Airflow SSH-ing into the EMR master
node. The 3-quarter migration to a single REST gateway
(Quarry) over YARN
+ YARN Distributed Shell is the
wiki's first end-to-end retrospective on eliminating direct SSH
access to EMR clusters at industrial scale — the unblocker for
EMR-on-EKS adoption and for Slack's "Whitecastle" main-account
→ child-account migration. Two latent failure modes surfaced:
concepts/master-node-resource-contention (jobs running on the
master rather than distributed across NodeManagers) and
concepts/resource-enforcement-bypass-via-ssh (vmem-check
failures previously hidden — fixed via AWS-recommended
yarn.nodemanager.vmem-check-enabled: false).
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's original "compactor" Spark job ran on EMR; was progressively displaced by a Ray-on-EC2 compactor for the largest ~1% of tables.
- sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — EMR is the multi-region data-platform substrate Slack modernised. Eight independent regions with data-sovereignty boundaries; 700+ Airflow-orchestrated jobs migrated from SSH-to-master-node to Quarry-to-YARN-REST over 3 quarters.
Related¶
- systems/apache-spark, systems/apache-hive — the most common EMR workloads.
- systems/apache-yarn — the resource manager EMR ships.
- systems/yarn-distributed-shell — the YARN feature Slack used to migrate arbitrary shell jobs onto EMR's REST surface.
- systems/aws-glue — serverless Spark + catalog alternative to EMR.
- systems/aws-ec2 — EMR's compute substrate.
- systems/aws-s3 — EMR's canonical storage substrate.
- systems/slack-quarry — Slack's REST gateway over EMR/YARN.
- concepts/master-node-resource-contention, concepts/resource-enforcement-bypass-via-ssh — the EMR-substrate failure modes Slack's migration surfaced.