PATTERN Cited by 1 source

OOM-aware VM-restart autoscaling¶

OOM-aware VM-restart autoscaling is the pattern of instrumenting the autoscaler with a task-level out-of-memory detector that triggers vertical scaling (larger VM) for the affected task and retries the task on the new VM — turning what would otherwise be a job failure into a transparently-absorbed workload variance event.

Canonical production instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."

Implemented in the Databricks Serverless Autoscaler.

The control loop¶

┌──────────────────────────┐
│   Task executes on VM_N  │
└─────────────┬────────────┘
              │
        ┌─────▼─────┐
        │    OOM?   │
        └─────┬─────┘
           No │   │ Yes
    ┌─────────┘   │
    │             ▼
    │      ┌────────────────┐
    │      │ Classify OOM   │
    │      │ (genuine vs    │
    │      │  transient)    │
    │      └──────┬─────────┘
    │             │
    │      ┌──────▼─────────┐
    │      │ Provision VM   │
    │      │ (size class    │
    │      │  N+1 or more)  │
    │      └──────┬─────────┘
    │             │
    │      ┌──────▼─────────┐
    │      │ Re-execute task│
    │      │ on new VM      │
    │      └──────┬─────────┘
    │             │
    └─────────────┴──────────►  Job continues

Why the pattern matters¶

The alternative — failing the job on OOM — is the traditional Spark behaviour. It's the anti-pattern because it forces users into static-stability-shaped over- provisioning: size the cluster for the worst-case workload footprint so OOMs never happen. This sacrifices the cost benefits of autoscaling.

OOM-aware VM-restart autoscaling inverts this:

Size for the common case at the default VM class
Absorb the exceptional case by scaling the specific task that needs it
Keep the job running — no user intervention, no retry-your- cluster-bigger operational burden

This is the load-bearing mechanism behind the stability-as-system- property design thesis for the memory axis specifically.

Preconditions¶

1. Task idempotency¶

The failed task must be safely re-executable without double-counting / double-writing / duplicating side effects. Spark's functional task model provides this natively for pure transformations; tasks with external side effects need idempotency enforcement at the task boundary.

For workloads that write to external systems (S3, databases, message queues), idempotency typically comes from:

Transactional writes (commit-on-success, rollback-on-failure)
Idempotent operation IDs (dedup on write)
Write-ahead staging + atomic swap

2. Dynamic VM provisioning¶

The platform must provision larger VMs on demand. Cloud-native infrastructure (AWS / Azure / GCP) provides this; fixed-capacity deployments (on-prem, bare-metal) don't.

3. OOM classification¶

Not every OOM should trigger a VM upsize — some are transient memory pressure that clears without structural action. The autoscaler needs a classifier:

Genuine OOM (data size exceeds VM memory) → upsize
Transient OOM (spike, GC pressure, leak) → retry on same VM
Skew OOM (pathological single partition) → might need re-partitioning, not just upsize

The Databricks post doesn't disclose the classifier's specifics; this is the part of the control loop that earns its complexity.

Escalation policy¶

A naive implementation could escalate to the largest VM size repeatedly, blowing the cost envelope. The pattern in practice typically includes:

Bounded escalation — cap at N size classes above default
Cooldown — don't escalate further if the task is actually pathological (skew) rather than merely big
Per-job budget — fail the job if escalation has exceeded a cost threshold

The Databricks post doesn't disclose these specifics either, but production implementations of the pattern inevitably require them.

Composition with horizontal scaling¶

OOM-aware VM-restart is the vertical axis of two-axis adaptive autoscaling. It composes with horizontal scaling (add/remove workers) to give a complete adaptive capacity story:

Horizontal handles parallelism / utilisation
Vertical (OOM-aware) handles per-task memory envelope
Together cover the failure modes each axis alone can't

Outcomes¶

Customer outcomes for the combined autoscaler (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

CKDelta: 20 min vs 4–5 hr job completion (12–15× speedup)
Unilever: 2–5× faster, 25% lower ops cost
HP: 32%+ cloud savings, 36% runtime reduction

The cost savings come largely from not over-provisioning for worst-case memory — the user's cluster-sizing decision is made for the common case, and the autoscaler absorbs variance.

Anti-patterns¶

Retry-on-same-VM-indefinitely — Spark's default behaviour without this pattern; the task repeatedly OOMs on equal-sized VMs
Upsize-without-classification — escalates on every signal, including transient pressure → thrashing + cost blow-up
Upsize-without-retry — scales the cluster but doesn't re-run the failed task; user still sees job failure
Ignore-task-idempotency — applies the pattern to workloads with non-idempotent side effects → double-writes / duplicated external state

Seen in¶

sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — First canonical wiki instance of OOM-aware VM-restart autoscaling as a named pattern. The Databricks Serverless Autoscaler is the canonical implementation. Combined with two-axis autoscaling as the vertical-axis trigger. Load-bearing for the stability-as-system- property design thesis at the memory axis.

systems/databricks-serverless-autoscaler — canonical production instance
concepts/adaptive-oom-recovery — the concept this pattern canonicalises
concepts/vertical-and-horizontal-autoscaling — the parent two-axis adaptive autoscaling
concepts/idempotent-operations — the precondition for task re-execution
concepts/memory-overcommit-risk — the failure mode the pattern absorbs
concepts/graceful-degradation — the user-visible property
concepts/stability-as-system-property — the design thesis the pattern implements at the memory axis