SYSTEM Cited by 1 source
Slack Quarry¶
Quarry is Slack's internal REST-based job-submission gateway sitting between callers (most prominently Airflow) and multiple compute engines (YARN on EMR, Trino, Snowflake). It is the canonical instance in the wiki of the REST gateway for compute-engine job submission pattern.
Origin¶
Quarry was "originally built to provide a unified interface for submitting jobs across multiple compute engines (EMR/YARN, Trino, Snowflake)" (Source: sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines). It pre-dated the SSH-deprecation initiative — the gateway already solved authentication, reliability, and observability — and became "exactly what we needed" when Slack decided to eliminate SSH-based job execution entirely.
What Quarry does¶
Per the 2026-05-05 retrospective, Quarry handles five concerns on behalf of its callers:
- Authentication — service-to-service tokens replace SSH keys on individual orchestration workers.
- Job submission — REST APIs to YARN, Trino, and Snowflake.
- State tracking — server-side monitoring of job status, so client crashes don't lose job state.
- Lifecycle management — clean cancellation and cleanup through REST APIs (DELETE on a job ID).
- Observability — structured logs, metrics, and tracing for every job submission.
The architecture shift it enabled¶
Verbatim from the post:
Before:
Airflow → SSH Connection → EMR Master Node → Execute CommandAfter:
Airflow → Quarry REST API → YARN ResourceManager → EMR Container
Three things change at once: the transport (SSH → HTTP), the state model (stateful connection → stateless RESTful resource with job ID), and the execution location (master node → YARN container with proper resource isolation).
YARN Distributed Shell as universal executor¶
The architectural breakthrough that made Quarry-via-YARN viable
for all of Slack's job types — not just Hadoop workloads —
was YARN Distributed Shell.
Spark and Hive already had REST APIs (Livy
and HiveServer2). MapReduce and the 300+ CLI-based jobs
running arbitrary shell commands (aws s3 sync, hadoop distcp,
custom Python scripts) had no native REST option until Slack
discovered DistShell — the YARN ApplicationMaster that runs an
arbitrary shell script in a YARN container with proper resource
limits, isolation, retry, cancellation, and logging. With
DistShell, "whether you're running a Spark job, a Hive query,
or a simple shell script, it all goes through the same REST
API."
See patterns/yarn-distributed-shell-as-universal-shell-executor for the named pattern.
Migration footprint¶
Quarry was the universal point through which Slack migrated:
- 700+ production jobs
- 7 operator types (named:
CrunchExecOperator,S3SyncOperator, plus 5 others) - 8 independent data regions with separate Quarry configurations, cluster endpoints, and network routing rules
- 5 teams (Search Infrastructure, Data Engineering & Analytics, ML Services, plus marketing-domain teams)
- 3 quarters end-to-end, zero downtime for business-critical services
What Quarry replaced (operationally)¶
Each of the operational improvements below maps to a property of the gateway architecture:
| Operational property | Quarry's mechanism |
|---|---|
| No more SSH keys on Airflow workers | Service-to-service tokens at the Quarry edge |
| No more zombie jobs after pod restart | Server-side state in Quarry; client crash ≠ job failure |
| No more master-node resource contention | All non-Hadoop jobs run in YARN containers via DistShell |
| Audit trail per job submission | Structured logs at Quarry's REST surface |
| Clean cancellation | DELETE on the job ID; Quarry forwards to YARN |
| Distinct status from terminate-on-success | GET on job ID returns running / completed / failed |
| Observability across multiple engines | Quarry's logs cover YARN + Trino + Snowflake uniformly |
What's not publicly disclosed¶
The 2026-05-05 post is a retrospective on the SSH-elimination initiative, not a Quarry architecture deep-dive. Not disclosed:
- Internal architecture of Quarry itself — process model, storage substrate for job state, how it handles HA, etc.
- Details of Trino + Snowflake adapters — only the YARN/DistShell path is explained.
- Token rotation cadence and surface — the article notes service-to-service tokens replace SSH keys, but does not disclose how token issuance / rotation / scoping works at the 700-jobs × 8-regions scale.
- No public API / open-source release — Quarry remains a Slack-internal system. The architectural shape is the generalisable artefact.
- Container resource sizing for arbitrary shell commands — how Slack picks per-shell-job memory / vCores when YARN now enforces limits that SSH had been silently bypassing.
Seen in¶
- sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — canonical wiki source. Quarry is positioned as the universal REST job-submission gateway that enabled 100% SSH elimination across 8 data regions, the unblocker for Spark-on-Kubernetes and Whitecastle child-account migration, and the substrate for service-token authentication + per-job audit trails. Sole source for Quarry as of 2026-05-21.
Related¶
- companies/slack
- systems/yarn-distributed-shell — the YARN feature that let Quarry serve arbitrary shell jobs through a single protocol.
- systems/apache-yarn — the resource manager Quarry submits to.
- systems/amazon-emr — the cluster substrate where Quarry's YARN backend runs.
- systems/apache-airflow — Quarry's biggest caller; SSH operators were replaced by Quarry operators in Airflow DAGs.
- systems/trino, systems/snowflake — additional engines Quarry fronts.
- patterns/rest-gateway-for-compute-engine-job-submission — the named architectural pattern Quarry canonicalises.
- patterns/yarn-distributed-shell-as-universal-shell-executor — the breakthrough enabler.
- concepts/rest-based-job-submission — the paradigm shift.
- concepts/ssh-job-execution-anti-pattern — what Quarry replaced.
- concepts/audit-trail — Quarry's per-submission logs are the new audit substrate.
- concepts/attack-surface-minimization — eliminating SSH keys across 8 regions × 700+ jobs is a textbook attack-surface-shrink project.