SYSTEM Cited by 2 sources

Amazon EMR¶

Amazon EMR (Elastic MapReduce) is AWS's managed Hadoop-ecosystem runtime — hosts for systems/apache-spark, systems/apache-hive, Presto, Flink, HBase, and other OSS big-data engines on systems/aws-ec2 (and more recently on EKS and serverless). It is the canonical "big data cluster as a service" on AWS and the substrate behind much of the post-Hadoop data-lake workload on systems/aws-s3.

Role for this wiki¶

EMR typically shows up as the thing you were running Spark on before something changed (a scale-out, a cost crunch, a move to a managed warehouse or a specialist engine like systems/ray). In the Amazon BDT Spark → Ray story, the Spark compactor ran on EMR clusters; Ray clusters run directly on EC2 (via the serverless job management substrate BDT built on top of systems/dynamodb + systems/aws-sns + systems/aws-sqs + systems/aws-s3).

In Slack's 2026-05-05 SSH-deprecation retrospective, EMR is the substrate underneath an org-wide modernisation: 700+ production jobs across 8 independent data regions had been orchestrated by Airflow SSH-ing into the EMR master node. The 3-quarter migration to a single REST gateway (Quarry) over YARN + YARN Distributed Shell is the wiki's first end-to-end retrospective on eliminating direct SSH access to EMR clusters at industrial scale — the unblocker for EMR-on-EKS adoption and for Slack's "Whitecastle" main-account → child-account migration. Two latent failure modes surfaced: concepts/master-node-resource-contention (jobs running on the master rather than distributed across NodeManagers) and concepts/resource-enforcement-bypass-via-ssh (vmem-check failures previously hidden — fixed via AWS-recommended yarn.nodemanager.vmem-check-enabled: false).

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon Retail BDT's original "compactor" Spark job ran on EMR; was progressively displaced by a Ray-on-EC2 compactor for the largest ~1% of tables.
sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — EMR is the multi-region data-platform substrate Slack modernised. Eight independent regions with data-sovereignty boundaries; 700+ Airflow-orchestrated jobs migrated from SSH-to-master-node to Quarry-to-YARN-REST over 3 quarters.

systems/apache-spark, systems/apache-hive — the most common EMR workloads.
systems/apache-yarn — the resource manager EMR ships.
systems/yarn-distributed-shell — the YARN feature Slack used to migrate arbitrary shell jobs onto EMR's REST surface.
systems/aws-glue — serverless Spark + catalog alternative to EMR.
systems/aws-ec2 — EMR's compute substrate.
systems/aws-s3 — EMR's canonical storage substrate.
systems/slack-quarry — Slack's REST gateway over EMR/YARN.
concepts/master-node-resource-contention, concepts/resource-enforcement-bypass-via-ssh — the EMR-substrate failure modes Slack's migration surfaced.

Amazon EMR¶

Role for this wiki¶

Seen in¶

Related¶