PATTERN Cited by 1 source
OOM-aware VM-restart autoscaling¶
OOM-aware VM-restart autoscaling is the pattern of instrumenting the autoscaler with a task-level out-of-memory detector that triggers vertical scaling (larger VM) for the affected task and retries the task on the new VM — turning what would otherwise be a job failure into a transparently-absorbed workload variance event.
Canonical production instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):
"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."
Implemented in the Databricks Serverless Autoscaler.
The control loop¶
┌──────────────────────────┐
│ Task executes on VM_N │
└─────────────┬────────────┘
│
┌─────▼─────┐
│ OOM? │
└─────┬─────┘
No │ │ Yes
┌─────────┘ │
│ ▼
│ ┌────────────────┐
│ │ Classify OOM │
│ │ (genuine vs │
│ │ transient) │
│ └──────┬─────────┘
│ │
│ ┌──────▼─────────┐
│ │ Provision VM │
│ │ (size class │
│ │ N+1 or more) │
│ └──────┬─────────┘
│ │
│ ┌──────▼─────────┐
│ │ Re-execute task│
│ │ on new VM │
│ └──────┬─────────┘
│ │
└─────────────┴──────────► Job continues
Why the pattern matters¶
The alternative — failing the job on OOM — is the traditional Spark behaviour. It's the anti-pattern because it forces users into static-stability-shaped over- provisioning: size the cluster for the worst-case workload footprint so OOMs never happen. This sacrifices the cost benefits of autoscaling.
OOM-aware VM-restart autoscaling inverts this:
- Size for the common case at the default VM class
- Absorb the exceptional case by scaling the specific task that needs it
- Keep the job running — no user intervention, no retry-your- cluster-bigger operational burden
This is the load-bearing mechanism behind the stability-as-system- property design thesis for the memory axis specifically.
Preconditions¶
1. Task idempotency¶
The failed task must be safely re-executable without double-counting / double-writing / duplicating side effects. Spark's functional task model provides this natively for pure transformations; tasks with external side effects need idempotency enforcement at the task boundary.
For workloads that write to external systems (S3, databases, message queues), idempotency typically comes from:
- Transactional writes (commit-on-success, rollback-on-failure)
- Idempotent operation IDs (dedup on write)
- Write-ahead staging + atomic swap
2. Dynamic VM provisioning¶
The platform must provision larger VMs on demand. Cloud-native infrastructure (AWS / Azure / GCP) provides this; fixed-capacity deployments (on-prem, bare-metal) don't.
3. OOM classification¶
Not every OOM should trigger a VM upsize — some are transient memory pressure that clears without structural action. The autoscaler needs a classifier:
- Genuine OOM (data size exceeds VM memory) → upsize
- Transient OOM (spike, GC pressure, leak) → retry on same VM
- Skew OOM (pathological single partition) → might need re-partitioning, not just upsize
The Databricks post doesn't disclose the classifier's specifics; this is the part of the control loop that earns its complexity.
Escalation policy¶
A naive implementation could escalate to the largest VM size repeatedly, blowing the cost envelope. The pattern in practice typically includes:
- Bounded escalation — cap at N size classes above default
- Cooldown — don't escalate further if the task is actually pathological (skew) rather than merely big
- Per-job budget — fail the job if escalation has exceeded a cost threshold
The Databricks post doesn't disclose these specifics either, but production implementations of the pattern inevitably require them.
Composition with horizontal scaling¶
OOM-aware VM-restart is the vertical axis of two-axis adaptive autoscaling. It composes with horizontal scaling (add/remove workers) to give a complete adaptive capacity story:
- Horizontal handles parallelism / utilisation
- Vertical (OOM-aware) handles per-task memory envelope
- Together cover the failure modes each axis alone can't
Outcomes¶
Customer outcomes for the combined autoscaler (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):
- CKDelta: 20 min vs 4–5 hr job completion (12–15× speedup)
- Unilever: 2–5× faster, 25% lower ops cost
- HP: 32%+ cloud savings, 36% runtime reduction
The cost savings come largely from not over-provisioning for worst-case memory — the user's cluster-sizing decision is made for the common case, and the autoscaler absorbs variance.
Anti-patterns¶
- Retry-on-same-VM-indefinitely — Spark's default behaviour without this pattern; the task repeatedly OOMs on equal-sized VMs
- Upsize-without-classification — escalates on every signal, including transient pressure → thrashing + cost blow-up
- Upsize-without-retry — scales the cluster but doesn't re-run the failed task; user still sees job failure
- Ignore-task-idempotency — applies the pattern to workloads with non-idempotent side effects → double-writes / duplicated external state
Seen in¶
- sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — First canonical wiki instance of OOM-aware VM-restart autoscaling as a named pattern. The Databricks Serverless Autoscaler is the canonical implementation. Combined with two-axis autoscaling as the vertical-axis trigger. Load-bearing for the stability-as-system- property design thesis at the memory axis.
Related¶
- systems/databricks-serverless-autoscaler — canonical production instance
- concepts/adaptive-oom-recovery — the concept this pattern canonicalises
- concepts/vertical-and-horizontal-autoscaling — the parent two-axis adaptive autoscaling
- concepts/idempotent-operations — the precondition for task re-execution
- concepts/memory-overcommit-risk — the failure mode the pattern absorbs
- concepts/graceful-degradation — the user-visible property
- concepts/stability-as-system-property — the design thesis the pattern implements at the memory axis