SYSTEM Cited by 2 sources
Apache Hadoop¶
Apache Hadoop is the canonical open-source batch-processing ecosystem: HDFS for distributed storage, MapReduce / YARN for job scheduling, plus a broader ecosystem (Hive, HBase, etc.).
This is a stub-tier page. Sub-pages cover the components of Hadoop most often referenced from wiki sources:
- systems/apache-yarn — the resource manager and scheduler at the heart of Hadoop. The substrate Slack's Quarry gateway forwards to.
- systems/yarn-distributed-shell — YARN's ApplicationMaster for arbitrary shell commands; the "little-known feature" that made one REST gateway viable for Slack's heterogeneous job mix.
- systems/apache-spark, systems/apache-hive — common Hadoop-ecosystem compute engines.
Seen in¶
- sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks — Hadoop jobs on a multi-thousand-node cluster can concentrate unusual workloads (e.g. reverse-DNS resolution of every IP in network-activity logs) into a short burst that saturates downstream infrastructure — here, Stripe's central DNS server cluster hitting the AWS VPC-resolver 1,024-pps-per-ENI cap during the hourly runs of a log-analysis job.
- sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — Slack's data platform ran on Hadoop's YARN substrate via EMR. The 700+ SSH-based jobs across 8 regions all eventually became YARN submissions through Quarry, including the 300+ shell-class jobs that ran via YARN Distributed Shell.
Related¶
- systems/apache-yarn — Hadoop's resource manager (own page).
- systems/yarn-distributed-shell — the breakthrough YARN feature for arbitrary shell jobs.
- systems/apache-spark, systems/apache-hive
- systems/amazon-emr — managed Hadoop on AWS.
- systems/slack-quarry — REST gateway over a Hadoop fleet.