Skip to content

SYSTEM Cited by 1 source

Slack Quarry

Quarry is Slack's internal REST-based job-submission gateway sitting between callers (most prominently Airflow) and multiple compute engines (YARN on EMR, Trino, Snowflake). It is the canonical instance in the wiki of the REST gateway for compute-engine job submission pattern.

Origin

Quarry was "originally built to provide a unified interface for submitting jobs across multiple compute engines (EMR/YARN, Trino, Snowflake)" (Source: sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines). It pre-dated the SSH-deprecation initiative — the gateway already solved authentication, reliability, and observability — and became "exactly what we needed" when Slack decided to eliminate SSH-based job execution entirely.

What Quarry does

Per the 2026-05-05 retrospective, Quarry handles five concerns on behalf of its callers:

  1. Authentication — service-to-service tokens replace SSH keys on individual orchestration workers.
  2. Job submission — REST APIs to YARN, Trino, and Snowflake.
  3. State tracking — server-side monitoring of job status, so client crashes don't lose job state.
  4. Lifecycle management — clean cancellation and cleanup through REST APIs (DELETE on a job ID).
  5. Observability — structured logs, metrics, and tracing for every job submission.

The architecture shift it enabled

Verbatim from the post:

Before: Airflow → SSH Connection → EMR Master Node → Execute Command

After: Airflow → Quarry REST API → YARN ResourceManager → EMR Container

Three things change at once: the transport (SSH → HTTP), the state model (stateful connection → stateless RESTful resource with job ID), and the execution location (master node → YARN container with proper resource isolation).

YARN Distributed Shell as universal executor

The architectural breakthrough that made Quarry-via-YARN viable for all of Slack's job types — not just Hadoop workloads — was YARN Distributed Shell. Spark and Hive already had REST APIs (Livy and HiveServer2). MapReduce and the 300+ CLI-based jobs running arbitrary shell commands (aws s3 sync, hadoop distcp, custom Python scripts) had no native REST option until Slack discovered DistShell — the YARN ApplicationMaster that runs an arbitrary shell script in a YARN container with proper resource limits, isolation, retry, cancellation, and logging. With DistShell, "whether you're running a Spark job, a Hive query, or a simple shell script, it all goes through the same REST API."

See patterns/yarn-distributed-shell-as-universal-shell-executor for the named pattern.

Migration footprint

Quarry was the universal point through which Slack migrated:

  • 700+ production jobs
  • 7 operator types (named: CrunchExecOperator, S3SyncOperator, plus 5 others)
  • 8 independent data regions with separate Quarry configurations, cluster endpoints, and network routing rules
  • 5 teams (Search Infrastructure, Data Engineering & Analytics, ML Services, plus marketing-domain teams)
  • 3 quarters end-to-end, zero downtime for business-critical services

What Quarry replaced (operationally)

Each of the operational improvements below maps to a property of the gateway architecture:

Operational property Quarry's mechanism
No more SSH keys on Airflow workers Service-to-service tokens at the Quarry edge
No more zombie jobs after pod restart Server-side state in Quarry; client crash ≠ job failure
No more master-node resource contention All non-Hadoop jobs run in YARN containers via DistShell
Audit trail per job submission Structured logs at Quarry's REST surface
Clean cancellation DELETE on the job ID; Quarry forwards to YARN
Distinct status from terminate-on-success GET on job ID returns running / completed / failed
Observability across multiple engines Quarry's logs cover YARN + Trino + Snowflake uniformly

What's not publicly disclosed

The 2026-05-05 post is a retrospective on the SSH-elimination initiative, not a Quarry architecture deep-dive. Not disclosed:

  • Internal architecture of Quarry itself — process model, storage substrate for job state, how it handles HA, etc.
  • Details of Trino + Snowflake adapters — only the YARN/DistShell path is explained.
  • Token rotation cadence and surface — the article notes service-to-service tokens replace SSH keys, but does not disclose how token issuance / rotation / scoping works at the 700-jobs × 8-regions scale.
  • No public API / open-source release — Quarry remains a Slack-internal system. The architectural shape is the generalisable artefact.
  • Container resource sizing for arbitrary shell commands — how Slack picks per-shell-job memory / vCores when YARN now enforces limits that SSH had been silently bypassing.

Seen in

Last updated · 542 distilled / 1,571 read