Skip to content

PATTERN Cited by 1 source

REST gateway for compute-engine job submission

Pattern

Place a single REST gateway between job-submitting clients (orchestrators, services, scheduled tasks) and the heterogeneous mix of compute engines the org actually runs (Hadoop/YARN, SQL warehouses, Snowflake-class warehouses, arbitrary shell executors). The gateway:

  • Accepts job submissions via POST and returns a job ID.
  • Forwards the submission to the right backend engine.
  • Tracks job state server-side for GET status and DELETE cancel operations.
  • Owns authentication, audit logging, and observability.

All clients deal with one auth model, one URL surface, one log schema, one cancellation contract. The diversity of compute engines is hidden behind a single resource model.

Canonical wiki instance

Slack's Quarry is the wiki's canonical instance, documented in the 2026-05-05 retrospective (sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines). Quarry sits between Airflow and three backends:

The architecture shift is, verbatim:

Before: Airflow → SSH Connection → EMR Master Node → Execute Command

After: Airflow → Quarry REST API → YARN ResourceManager → EMR Container

Why one gateway, not N adapters

Without a gateway, each job-type/engine pair grows its own client wrapper, its own auth integration, its own log format, its own cancellation idiom. The matrix is clients × engines and every cell needs maintenance.

With a gateway, the matrix collapses: clients all speak Quarry's REST. Adding a new engine adapter benefits every client; adding a new client benefits from every existing engine adapter. The post's framing:

"With YARN Distributed Shell support, Quarry became our universal job submission gateway. Whether you're running a Spark job, a Hive query, or a simple shell script, it all goes through the same REST API."

What the gateway owns (concretely)

Five concerns, per Quarry's documented role:

  1. Authentication — service-to-service tokens at the gateway edge, replacing per-worker SSH keys (see concepts/attack-surface-minimization for why this is the high-leverage move).
  2. Job submission — REST forwarding to YARN, Trino, Snowflake.
  3. State tracking — server-side state addressable by job ID, so client crashes don't lose jobs (see concepts/rest-based-job-submission).
  4. Lifecycle management — clean cancellation + resource cleanup via DELETE.
  5. Observability — structured logs, metrics, and tracing for every submission. Quarry's logs become the single audit trail across all engines.

Why it's a migration vehicle, not just an architecture

The gateway pattern's load-bearing operational property is that you can migrate to it incrementally. Slack didn't replace 700+ jobs in a big-bang switchover; they migrated one operator type at a time (see patterns/incremental-operator-by-operator-migration) over 3 quarters across 8 regions with zero downtime. The gateway is the destination and the staging ground — old SSH-based jobs and new REST-based jobs coexist, indexed by operator type, until the last SSH operator is deprecated.

The post identifies this as decisive for projects this size:

"Built monitoring before you migrate. Set up tracking dashboards early so you always know what's left to migrate. Airflow database queries made it easy to identify remaining work. Progress visibility kept the project moving."

When to use this pattern

Strong fit when all of these are true:

  • You have multiple compute engines you submit jobs to.
  • Today's submission paths are heterogeneous (SSH for one, framework REST for another, custom RPC for a third).
  • Auth is a per-engine concern, often with long-lived credentials distributed widely.
  • Audit/observability is scattered across engines.
  • You want to evolve the engine fleet without coupling every client to every engine.

Weaker fit when:

  • You have one compute engine and its native REST API is already adequate. (You can use that directly.)
  • The submission rate is low enough that latency from an extra hop matters more than the security/observability gains.
  • You don't have an organisation-wide auth substrate that the gateway can adopt — building one just for the gateway is a parallel project that may not pay back.

Composition with other patterns

The gateway pattern is not load-bearing on its own for heterogeneous workloads — the universality property requires a shell-runner backend for arbitrary commands. Slack's discovery of YARN Distributed Shell was what made one-gateway-for-everything actually viable; see patterns/yarn-distributed-shell-as-universal-shell-executor.

Closely paired migration patterns:

Failure modes

  • Gateway becomes a bottleneck — single REST surface for every job in the company; needs HA + horizontal scaling treatment from day one.
  • Gateway becomes a control-plane SPOF — auth/audit decisions all flow through it; outage = nothing submits.
  • Adapter sprawl in the gateway — each backend's quirks leak into the gateway codebase if the abstraction isn't carefully designed.
  • Token-rotation surface replaces key-rotation surface — the security improvement is real but not zero-cost; you need automated token issuance and revocation tooling. (Slack does not disclose how Quarry handles this.)

Seen in

Last updated · 542 distilled / 1,571 read