Skip to content

CONCEPT Cited by 1 source

Adaptive OOM recovery

Adaptive OOM recovery is the primitive of detecting a task-level out-of-memory error at runtime, restarting the task on a larger VM, and continuing the job — rather than surfacing the OOM to the user as a job failure requiring manual cluster retuning.

Canonical instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."

Implemented in the Databricks Serverless Autoscaler as part of Databricks Serverless Compute.

Why this matters

In traditional Spark (and most distributed-compute systems), an OOM is terminal at the task-retry boundary:

  • Task runs on a 64 GiB worker
  • Task OOMs due to skew / unexpected data volume
  • Spark retries the task on another 64 GiB worker (most likely with the same partition) → OOM again
  • After N retries, the job fails
  • User is told to rerun with a bigger cluster

This failure mode forces users to pre-provision for the worst case to avoid mid-job OOM — the canonical anti-pattern of static stability applied to memory: you size for the peak, you pay for the peak always.

Adaptive OOM recovery inverts this: the platform sizes for the current-cluster common case, detects OOM when the common case is violated, and escalates VM size only for the tasks that need it.

The control loop

  1. Detect. A task signals OOM (via JVM Heap exhaustion, OOM killer, or spark.memoryManager eviction pressure)
  2. Classify. Distinguish genuine data-sized-wrong OOM from transient memory pressure (to avoid thrashing on larger VMs for no reason)
  3. Provision. Allocate a larger VM — one size class up is typical; two or more for repeat offenders
  4. Re-execute. Spark's task-retry mechanism naturally retries the failing partition on the new VM
  5. Continue. The job completes without user intervention

Step 2 (classification) is where the autoscaler earns its complexity — a reactive scaler that upsizes on every OOM would make decisions that optimise for the outlier task at the cost of the aggregate cluster envelope.

Preconditions

Adaptive OOM recovery requires two architectural preconditions:

  1. Task idempotency — the failed task must be safely re-executable on a new VM without duplicating side effects. Spark's functional task model provides this by default; systems with mutable side effects (in-place writes, external API calls) may need idempotency enforcement at the task boundary.

  2. Dynamic VM provisioning — the platform must be able to allocate a larger VM on demand. On-prem / fixed-capacity deployments can't do this; cloud-native / containerised deployments can.

Composition with concepts/vertical-and-horizontal-autoscaling

Adaptive OOM recovery is the vertical axis of two-axis adaptive autoscaling. Horizontal autoscaling (add/remove workers) handles throughput / utilisation. Vertical autoscaling handles per-task memory envelope. OOM recovery is the specific trigger for the vertical axis.

Without the vertical axis, OOM recovery isn't possible — you can add more 64 GiB workers, but they won't fit the task that OOMs on 64 GiB.

Customer outcomes

Named production outcomes for the autoscaler (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

  • CKDelta: 20 min vs 4–5 hr job completion
  • Unilever: 2–5× faster pipelines, 25% lower ops cost
  • HP: 32%+ cloud savings, 36% runtime reduction

These span 12–15× speedup on the high end and 30–40% cost savings — the combined impact of not over-provisioning for worst-case memory plus avoided job-failure retry cost.

Caveats and limits

  • Task-level granularity only. Whole-job OOMs (driver OOM) are harder; the post doesn't disclose how those are handled.
  • Non-idempotent workloads. Tasks that write to external systems without transaction semantics could double-write on restart.
  • Pathological data skew. A single mega-partition may OOM on every VM size — the autoscaler can't solve a skew problem, only absorb variance around a correctly-distributed dataset.
  • Cost amplification risk. A naive implementation could escalate to expensive VMs repeatedly; the autoscaler must cap the escalation.

Seen in

Last updated · 451 distilled / 1,324 read