Skip to content

PATTERN Cited by 1 source

OOM-aware VM-restart autoscaling

OOM-aware VM-restart autoscaling is the pattern of instrumenting the autoscaler with a task-level out-of-memory detector that triggers vertical scaling (larger VM) for the affected task and retries the task on the new VM — turning what would otherwise be a job failure into a transparently-absorbed workload variance event.

Canonical production instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."

Implemented in the Databricks Serverless Autoscaler.

The control loop

┌──────────────────────────┐
│   Task executes on VM_N  │
└─────────────┬────────────┘
        ┌─────▼─────┐
        │    OOM?   │
        └─────┬─────┘
           No │   │ Yes
    ┌─────────┘   │
    │             ▼
    │      ┌────────────────┐
    │      │ Classify OOM   │
    │      │ (genuine vs    │
    │      │  transient)    │
    │      └──────┬─────────┘
    │             │
    │      ┌──────▼─────────┐
    │      │ Provision VM   │
    │      │ (size class    │
    │      │  N+1 or more)  │
    │      └──────┬─────────┘
    │             │
    │      ┌──────▼─────────┐
    │      │ Re-execute task│
    │      │ on new VM      │
    │      └──────┬─────────┘
    │             │
    └─────────────┴──────────►  Job continues

Why the pattern matters

The alternative — failing the job on OOM — is the traditional Spark behaviour. It's the anti-pattern because it forces users into static-stability-shaped over- provisioning: size the cluster for the worst-case workload footprint so OOMs never happen. This sacrifices the cost benefits of autoscaling.

OOM-aware VM-restart autoscaling inverts this:

  • Size for the common case at the default VM class
  • Absorb the exceptional case by scaling the specific task that needs it
  • Keep the job running — no user intervention, no retry-your- cluster-bigger operational burden

This is the load-bearing mechanism behind the stability-as-system- property design thesis for the memory axis specifically.

Preconditions

1. Task idempotency

The failed task must be safely re-executable without double-counting / double-writing / duplicating side effects. Spark's functional task model provides this natively for pure transformations; tasks with external side effects need idempotency enforcement at the task boundary.

For workloads that write to external systems (S3, databases, message queues), idempotency typically comes from:

  • Transactional writes (commit-on-success, rollback-on-failure)
  • Idempotent operation IDs (dedup on write)
  • Write-ahead staging + atomic swap

2. Dynamic VM provisioning

The platform must provision larger VMs on demand. Cloud-native infrastructure (AWS / Azure / GCP) provides this; fixed-capacity deployments (on-prem, bare-metal) don't.

3. OOM classification

Not every OOM should trigger a VM upsize — some are transient memory pressure that clears without structural action. The autoscaler needs a classifier:

  • Genuine OOM (data size exceeds VM memory) → upsize
  • Transient OOM (spike, GC pressure, leak) → retry on same VM
  • Skew OOM (pathological single partition) → might need re-partitioning, not just upsize

The Databricks post doesn't disclose the classifier's specifics; this is the part of the control loop that earns its complexity.

Escalation policy

A naive implementation could escalate to the largest VM size repeatedly, blowing the cost envelope. The pattern in practice typically includes:

  • Bounded escalation — cap at N size classes above default
  • Cooldown — don't escalate further if the task is actually pathological (skew) rather than merely big
  • Per-job budget — fail the job if escalation has exceeded a cost threshold

The Databricks post doesn't disclose these specifics either, but production implementations of the pattern inevitably require them.

Composition with horizontal scaling

OOM-aware VM-restart is the vertical axis of two-axis adaptive autoscaling. It composes with horizontal scaling (add/remove workers) to give a complete adaptive capacity story:

  • Horizontal handles parallelism / utilisation
  • Vertical (OOM-aware) handles per-task memory envelope
  • Together cover the failure modes each axis alone can't

Outcomes

Customer outcomes for the combined autoscaler (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

  • CKDelta: 20 min vs 4–5 hr job completion (12–15× speedup)
  • Unilever: 2–5× faster, 25% lower ops cost
  • HP: 32%+ cloud savings, 36% runtime reduction

The cost savings come largely from not over-provisioning for worst-case memory — the user's cluster-sizing decision is made for the common case, and the autoscaler absorbs variance.

Anti-patterns

  • Retry-on-same-VM-indefinitely — Spark's default behaviour without this pattern; the task repeatedly OOMs on equal-sized VMs
  • Upsize-without-classification — escalates on every signal, including transient pressure → thrashing + cost blow-up
  • Upsize-without-retry — scales the cluster but doesn't re-run the failed task; user still sees job failure
  • Ignore-task-idempotency — applies the pattern to workloads with non-idempotent side effects → double-writes / duplicated external state

Seen in

Last updated · 451 distilled / 1,324 read