CONCEPT Cited by 1 source

Resource enforcement bypass via SSH¶

Definition¶

When jobs are executed via direct SSH onto a cluster's master node, they bypass the cluster resource manager's enforcement of per-job memory/CPU limits. Jobs that should have been rejected for exceeding their resource budget run anyway — until the migration to a properly-enforced submission path (REST-via-resource-manager) surfaces every previously-silent violation at once.

This is a latent failure mode of the SSH job execution anti-pattern that only becomes visible at the point of migration off SSH.

The canonical instance: vmem checks¶

Slack's 2026-05-05 retrospective documents the vmem-check shape verbatim:

"During migration of a data export DAG, we hit unexpected failures. Jobs that'd been running fine via SSH were now failing with vmem (virtual memory) check errors. What gives?

The root cause: SSH commands ran directly on the master node, bypassing YARN's resource enforcement entirely. Quarry submits jobs properly to YARN, which actually enforces resource limits. The vmem check was rejecting containers that exceeded virtual memory limits (which SSH had been quietly ignoring)."

The fix Slack applied, also verbatim — "Following AWS best practices, we disabled vmem checks across all clusters":

"yarn.nodemanager.vmem-check-enabled": "false"

The justification: "AWS explicitly recommends this because virtual memory accounting in Linux can be unreliable, and physical memory limits are sufficient. (Also, it's worth noting that vmem checks have been a source of spurious failures for years in the Hadoop ecosystem.)"

The general shape¶

The vmem case is one instance of a broader pattern: a substrate that intercepts job submission and applies policy is the only place that policy actually applies. When the submission path goes around the substrate (SSH-to-master-node), the policy is silently absent. Migrating to the substrate's intended submission path (YARN's REST API) re-applies the policy, and every previously-non-compliant job becomes a failure on day one.

This generalises beyond YARN/vmem:

CPU quotas in Kubernetes are bypassed by kubectl exec into a pod that already has the workload, vs. submitting a fresh pod through the scheduler.
Database connection limits are bypassed by process-internal connection pools that the DB doesn't see, vs. the gateway-enforced limits at PgBouncer or similar.
Network egress controls are bypassed by direct curl from inside a host that has the egress rules disabled, vs. submitting through a proxy that applies them.

The structural lesson, from the retrospective:

"When migrating from SSH to proper YARN submission, expect to encounter resource limit issues that were previously invisible. SSH hides a lot of problems."

This concept is the resource-enforcement face of a broader SSH-substrate failure mode:

Resource enforcement bypass (this page) — limits aren't applied.
Network-topology enforcement bypass — the EKM-connectivity story from the same Slack post: SSH had been piggy-backing on the master's network routing; migration to a different cluster surfaced a previously-hidden network dependency.
Audit-trail bypass — see audit trail; SSH-executed commands don't surface in any structured per-submission log.

All three share the structure: the SSH path papers over a property the substrate would otherwise enforce, and the property's absence is undetectable until the path changes.

Migration implication¶

The discipline this concept implies, from the post:

"Test thoroughly in dev environments before production rollout. […] Recommendation: Include resource limit testing in the initial pilot migration phase. SSH bypasses a lot of stuff, and you want to know what that stuff is before it bites you in production."

The migration itself functions as an audit of all previously-undisclosed resource violations across the entire job catalogue. This is parallel to concepts/observability-before-migration in spirit — build the visibility before the cutover so the cutover doesn't become a multi-week debugging exercise — but at the resource-enforcement axis rather than the metrics-pipeline axis.

Seen in¶

sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — canonical wiki source. The vmem-check story is documented as Challenge 1 of the migration; AWS's yarn.nodemanager.vmem-check-enabled: false recommendation is the production-applied fix.

concepts/ssh-job-execution-anti-pattern — the broader anti-pattern this is one face of.
concepts/master-node-resource-contention — the resource-allocation face of the same SSH bypass.
concepts/rest-based-job-submission — the architectural alternative.
concepts/observability-before-migration — sibling discipline at a different axis.
systems/apache-yarn — the substrate whose enforcement was being bypassed.
systems/slack-quarry — the gateway whose introduction surfaced the latent violations.
patterns/rest-gateway-for-compute-engine-job-submission — the migration target.