Skip to content

SLACK 2026-05-05 Tier 2

Read original ↗

From SSH to REST: A Security-Driven Modernization of Slack's EMR Data Pipelines

Summary

Slack's data platform was built around 2017 with a simple orchestration pattern: Airflow would run jobs on EMR clusters by SSH-ing into the cluster's master node and executing commands directly. That pattern proliferated into 700+ production jobs across 7 operator types by 2024 — daily search indexing on terabytes of data, analytics jobs, MapReduce, Spark, Hive, raw shell commands like aws s3 sync and hadoop distcp. Every one of them required SSH credentials with master-node access, and the cumulative attack surface had become a structural blocker: it prevented Slack from migrating to Spark-on-Kubernetes, from moving the last main-account EMR clusters to child accounts (the "Whitecastle" initiative), and from implementing proper job-level observability. This article is the retrospective on how Slack eliminated SSH entirely across 8 independent data regions with zero downtime in 3 quarters, by funnelling every job submission through a single REST gateway — Quarry — backed by YARN and the breakthrough enabler YARN Distributed Shell.

Key takeaways

  1. The architectural breakthrough was discovering YARN Distributed Shell. Spark and Hive already had REST APIs (Livy and HiveServer2 respectively). The hard part was MapReduce + 300+ CLI-based jobs running arbitrary shell commands with no native REST option. Slack considered building a custom remote-execution wrapper, using Ansible/Salt, or creating a new YARN job type from scratch — all rejected as too complex or requiring custom security layers. Then they discovered org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster, "a little-known feature […] already part of YARN, used the same REST APIs, and required no custom security layer." It runs an arbitrary shell script in a YARN container — proper resource limits, isolation, retry, cancellation, logging — via the standard YARN REST API. Canonical instance of patterns/yarn-distributed-shell-as-universal-shell-executor. (Source: this article.)

  2. One REST gateway across multiple compute engines unifies security, observability, and lifecycle management. Quarry, originally built as a unified job-submission interface across YARN/EMR + Trino + Snowflake, became the universal point of authentication / state-tracking / cancellation for all 700+ jobs. "Whether you're running a Spark job, a Hive query, or a simple shell script, it all goes through the same REST API." The architecture shift, verbatim:

Before: Airflow → SSH Connection → EMR Master Node → Execute Command After: Airflow → Quarry REST API → YARN ResourceManager → EMR Container

Canonical instance of patterns/rest-gateway-for-compute-engine-job-submission.

  1. SSH was hiding latent failure modes — they surfaced on first contact with proper YARN submission. Two illustrative challenges:

  2. vmem-check failures. Jobs that ran fine via SSH started failing under Quarry with virtual-memory check errors. Root cause: SSH ran commands directly on the master node, bypassing YARN's resource enforcement entirely. Quarry submits jobs to YARN, which actually enforces limits. Fix: follow AWS guidance — "yarn.nodemanager.vmem-check-enabled": "false" — because Linux virtual-memory accounting is unreliable and physical limits are sufficient. Canonicalised as concepts/resource-enforcement-bypass-via-ssh.

  3. EKM connectivity timeout. A search-infrastructure job migrated to a new staging cluster failed with Unable to execute HTTP request: Connect to sts.amazonaws.com:443 failed: connect timed out. Root cause: the original cluster had network routing to reach key-management endpoints; the new cluster, in a stricter network segment, did not. "The failure surfaced a hidden dependency on network topology that wasn't captured in the job's configuration." Slack moved jobs to clusters with the right routing rather than punching holes through the network.

Both failures were invisible under SSH. The migration acted as an audit. "SSH hides a lot of problems" is the article's verbatim takeaway.

  1. Multi-region migration is N times harder than single-region, with N unique failure modes. Slack runs EMR clusters across 8 independent data regions for data-sovereignty compliance. The SSH deprecation was "effectively 8 parallel migrations, each with its own special set of challenges." Each region required separate Quarry configurations, cluster endpoints, and network routing. Every code change needed validation across all 8 regions. They mitigated by piloting in a single US region, documenting region-specific configuration upfront, building region-aware operators, and rolling out incrementally with separate progress tracking per region. The post is one of the more operationally honest disclosures about regional-multiplier overhead.

  2. Server-side job lifecycle survives client crashes — that was the operational killer feature. Under the SSH model, a Kubernetes pod restart broke the SSH connection and the job either failed, became a zombie, or finished without anyone knowing whether it succeeded. Under Quarry/YARN, "the job keeps running, and Quarry maintains the connection" — client crashes are no longer job failures. "No more zombie processes. Jobs terminate properly when cancelled through REST APIs." The REST job-submission model puts state on the server, not in the SSH session.

  3. "Master node resource contention" was the silent second-order cost of SSH. Because SSH commands ran on the master node directly (not as YARN containers), 700+ jobs competed for resources on the EMR master rather than being distributed across worker nodes with proper resource allocation. "Master node resource contention: eliminated." This was an operational improvement most teams hadn't realised they were paying for. Canonicalised as concepts/master-node-resource-contention.

  4. Analytics-driven progress tracking kept the migration moving. Slack built an Analytics dashboard backed by queries against the Airflow metadata database to identify remaining SSH-based tasks per team / per DAG / per region. This gave the project leadership real-time visibility into what was left and let teams prioritise without ambiguity. A data-driven progress mechanism, distinct from the burndown chart shape — "querying the Airflow database to identify remaining SSH-based tasks made it easy to see which teams/DAGs still needed migration." Best-practice recommendation from the post: "Build monitoring before you migrate."

  5. OKR-driven execution was the load-bearing organisational move. Phase 3 of the migration was explicitly labelled an OKR-Driven Execution phase: "Made it a Key Result with executive visibility. This created accountability and kept it prioritized. We hit the 80% migration milestone during this phase." The 700+ job migration would have stalled indefinitely without executive sponsorship — the article identifies this organisational mechanism as decisive.

  6. Progressive operator deprecation, not big-bang switchover. Slack deprecated SSH operators one at a time (CrunchExecOperator, then S3SyncOperator, etc.) — "each deprecation was its own mini-project with testing and validation." Slower than migrating everything at once, but "greatly mitigated the risk of the migration." Canonicalised as patterns/incremental-operator-by-operator-migration.

Migration by the numbers (verbatim)

  • 700+ jobs migrated across 7 operator types
  • 8 independent data regions with coordinated rollouts
  • 5 teams transitioned to new operators (Search Infrastructure, Data Engineering & Analytics, ML Services, plus marketing domain mentioned in retrospective)
  • Zero downtime for business-critical services
  • 3 quarters, from initial pilot to 100% SSH elimination
  • "With two years of production experience since completion, the architectural decisions have proven sound. […] No regrets."

Architecture extracted

The five phases

  1. Proof of Concept — pilot jobs validate Quarry approach; first Quarry operators built; tested in dev environments.
  2. Security Review — security teams plan credential elimination, validate REST approach against requirements.
  3. OKR-Driven Execution — Key Result with executive visibility; 80% milestone hit during this phase.
  4. Bulk Migration — heavy cross-team coordination across all regions; multiple teams in parallel.
  5. Final Cleanup — overlooked DAGs migrated; legacy SSH-based operators deprecated; 100% SSH elimination.

This is structurally a five-phase migration — see patterns/phased-migration-with-soak-times and patterns/incremental-operator-by-operator-migration for the generalised shapes.

YARN Distributed Shell submission flow (extracted)

  1. Upload script to S3 — e.g. s3://bucket/command.sh containing aws s3 sync /tmp/data/ s3://bucket/output/.
  2. Submit to YARN with DistShell config — application-type MAPREDUCE, am-container-spec invokes org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster with environment variables DISTRIBUTEDSHELLSCRIPTLOCATION / DISTRIBUTEDSHELLSCRIPTLEN / DISTRIBUTEDSHELLSCRIPTTIMESTAMP.
  3. YARN allocates a container, downloads the script, and executes it with proper resource limits, container isolation, retry/fault tolerance, clean cancellation, and logging through YARN UI.

The Quarry interception layer

Quarry sits between callers (Airflow being the biggest user) and multiple compute engines:

  • Authentication — service-to-service tokens replace SSH keys.
  • Job submission — REST APIs to YARN, Trino, Snowflake.
  • State tracking — server-side monitoring of job status.
  • Lifecycle management — clean cancellation + cleanup through REST APIs.
  • Observability — structured logs, metrics, tracing for all job submissions.

Operational numbers and configuration

  • yarn.nodemanager.vmem-check-enabled: false — applied across all clusters per AWS best practices, after vmem-check failures surfaced on jobs that had been silently exceeding limits under SSH.
  • 8 EMR data regions, each requiring its own Quarry configuration, cluster endpoints, and network routing rules.
  • 7 operator types deprecated (named ones in post: CrunchExecOperator, S3SyncOperator).

What was unblocked

  • Spark-on-Kubernetes migration — required eliminating SSH dependencies first.
  • Whitecastle initiative — moving the last main-account EMR clusters to child accounts (cross-account audit boundary).
  • EMR-on-EKS adoption — also blocked on SSH dependencies.
  • Standardised job submission — consolidated all submission through Quarry, so future engine swaps don't need per-team changes.

Caveats and what's not disclosed

  • No exact $ savings figure, just qualitative "better security, operational stability, and infrastructure flexibility."
  • No public API or open-source release of Quarry — it remains an internal Slack system. Architecturally, similar shapes are openly described in the literature (see patterns/rest-gateway-for-compute-engine-job-submission for the generalised pattern).
  • No detail on how Trino + Snowflake adapters work inside Quarry — only the YARN/DistShell path is explained.
  • No quantified incident-rate change for the data platform pre/post migration; the customer-impact framing is "zero downtime for business-critical services" during the rollout rather than a year-over-year reliability number.
  • No discussion of how Quarry handles auth-token rotation at scale (700+ jobs × 8 regions implies a non-trivial token surface that replaces the SSH-key surface).
  • YARN Distributed Shell's resource-allocation model for arbitrary shell commands is described qualitatively. The article does not disclose how Slack sizes container memory / vCores per shell-job class, or whether they had to build a size-recommendation system.

What this canonicalises in the wiki

This is the wiki's first source on:

It also extends:

  • concepts/observability-before-migration — the Analytics-dashboard-on-Airflow-DB approach is a new instance of "build monitoring before you migrate" — at the project-progress altitude rather than the transport-protocol-probes altitude (the prior Slack HTTP/3 instance).
  • concepts/attack-surface-minimization — eliminating SSH across 8 regions × 700+ jobs is a textbook attack-surface-shrink project. The structural-blocker framing ("we couldn't move forward on any infrastructure modernization") is a useful complement to Meta's "feature gating" WhatsApp framing.
  • concepts/long-lived-key-risk — SSH keys distributed across Kubernetes orchestration workers are the canonical "long-lived SSH key" risk class; this article is the wiki's first large-scale industrial elimination of that class via service-token replacement.
  • concepts/audit-trail"replaced SSH key distribution with service-to-service token authentication, and gained proper audit trails through REST API logging" — Quarry's per-submission logs are the new audit substrate; "No more 'who ran that command?' mysteries."

Source

Last updated · 542 distilled / 1,571 read