Skip to content

CONCEPT Cited by 1 source

SSH job execution anti-pattern

Definition

The SSH job execution anti-pattern is the use of direct SSH into compute clusters as the submission and lifecycle substrate for batch jobs — typically because SSH was the simplest path when the platform was first built, and the choice ossified. It looks like this:

task = SSHOperator(
    task_id='run_spark_job',
    ssh_conn_id='emr_master',
    command='spark-submit /path/to/job.py',
)

It works at small scale. It accumulates structural debt at medium scale. At large scale it becomes a structural blocker to any infrastructure modernisation, as Slack's 2026-05-05 retrospective documents in detail.

The four classes of cost

The post enumerates the costs that grow with scale:

1. Security debt

  • Direct SSH access to compute clusters increases the potential attack surface (concepts/attack-surface-minimization is the inverse discipline).
  • Key distribution and rotation across orchestration workers adds operational overhead. The keys are typically long-lived (see concepts/long-lived-key-risk for the priority-ladder framing).
  • Audit granularity requires correlating logs across multiple systems — "No more 'who ran that command?' mysteries" is the post-migration improvement (concepts/audit-trail).
  • Permission management grows complex, requiring dedicated security groups and custom configurations.

2. Operational fragility

  • Stateful connection failure modes. When Kubernetes pods restart, SSH connections break and jobs fail.
  • Zombie processes. Long-running jobs become "zombies" that keep executing after their connections terminate.
  • Status ambiguity. "No reliable way to determine if a job succeeded or failed when connections dropped (not ideal when you're processing terabytes)."
  • Resource contention on the master node. All jobs compete on a single shared host instead of being distributed via the cluster's resource manager — see concepts/master-node-resource-contention.
  • Resource-enforcement bypass. SSH-executed commands silently exceed limits that the resource manager would normally enforce, only to fail dramatically once those limits start being applied — see concepts/resource-enforcement-bypass-via-ssh.

3. Structural-modernisation blocker

This is the highest-leverage cost and the one most often underweighted at small scale. Quoting the post:

"We were blocked: Couldn't start the path for Spark on Kubernetes nor EMR on EKS (required eliminating SSH dependencies first). Couldn't complete our Whitecastle initiative because we needed to move the last main-account EMR clusters to child accounts. Couldn't implement proper job monitoring and observability."

SSH-based execution couples what runs the job to which host you can reach in a way that any substrate change has to undo before it can proceed. The cost of leaving SSH in place isn't the day-to-day operational pain — it's the year-long projects that can't start.

4. Loss of multi-tenant resource isolation

SSH commands run on the master node directly, with no container isolation, no per-job CPU/memory limits, no proper retry semantics. Each job is silently a noisy-neighbor risk for every other job sharing the master.

The shape of the anti-pattern

It typically arises through incremental proliferation, not deliberate architecture:

"This pattern proliferated across the platform. Teams built custom SSH-based operators for different use cases (because hey, if SSH works for Spark, why not everything else). By the time we took stock, we had 700+ jobs in production running everything from MapReduce jobs to AWS CLI commands to custom Python scripts."

Once dozens of operator types exist, switching costs are prohibitive without a unified replacement. Slack's resolution was to build Quarry as a single REST gateway that all SSH operators could be retired against.

How the anti-pattern hides its costs

Two of Slack's challenges illuminate this:

  • vmem-check failures. SSH bypassed YARN's resource enforcement, so jobs had been silently exceeding memory limits. They only failed once Quarry submitted them properly. "SSH hides a lot of problems."
  • EKM connectivity timeouts. SSH had been silently piggy-backing on the master node's network configuration, so jobs had hidden network-topology dependencies that were undocumented and opaque.

The anti-pattern's most insidious property is that the substrate works — it just papers over things you'd want to know about.

The replacement architecture

The architectural alternative is concepts/rest-based-job-submission, typically composed via a single REST gateway in front of YARN / Trino / Snowflake / shell-runners (YARN Distributed Shell for the arbitrary-shell case).

Seen in

Last updated · 542 distilled / 1,571 read