Skip to content

SYSTEM Cited by 1 source

Databricks Serverless Autoscaler

The Databricks Serverless Autoscaler is the adaptive cluster- sizing controller in Databricks Serverless Compute. It scales clusters both horizontally and vertically, and — the distinguishing feature — restarts OOM-failed tasks on larger VMs rather than surfacing the failure to the user.

Canonical framing (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

"Dynamic cluster sizing is the primary mechanism for optimizing price-performance in distributed systems, but determining the optimal amount of compute is inherently complex. The optimal configuration depends on workload characteristics, data size, and the relative importance of latency versus cost, with no single configuration working across all scenarios."

Why static autoscaling rules fail

Prior state of the art (canonicalised in the post):

"Traditional autoscaling approaches rely on static rules and reactive thresholds, which often fail to capture these nuances. As a result, clusters are frequently under or over-provisioned, leading to inefficiency, instability, or both."

Two-axis adaptive scaling

The Serverless Autoscaler departs from threshold-based scalers in two ways:

  1. Signal-driven, not threshold-driven. "By continuously analyzing workload patterns and system-wide signals, the autoscaler positions each workload on the optimal cost- performance curve." See concepts/price-performance-ratio.

  2. Horizontal + vertical. "It dynamically adjusts compute capacity by scaling horizontally and vertically as needed." See concepts/vertical-and-horizontal-autoscaling. This combination is rare — most autoscalers operate on a single axis.

The OOM-restart primitive

The most architecturally distinctive behaviour (canonical quote):

"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."

This is concepts/adaptive-oom-recovery — first wiki canonical instance. The primitive assumes task idempotency (which Spark's task-retry model already requires). When an OOM fires:

  1. Autoscaler detects the OOM signal
  2. Selects a larger VM type
  3. Re-executes the failed task there
  4. Job continues; no user intervention

Canonicalised in patterns/oom-aware-vm-restart-autoscaling.

Customer outcomes

Named production outcomes (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

  • CKDelta: "jobs completing in 20 minutes that previously ran for 4–5 hours" — 12–15× end-to-end speedup
  • Unilever: "pipelines running 2–5x faster" with "operational costs down 25%"
  • HP: "cloud savings of over 32%" and "decreased combined job runtime by 36%"

Interaction with the Gateway

The autoscaler is tightly coupled to the Serverless Gateway:

  • Gateway signals the autoscaler when no cluster has headroom for a workload → autoscaler provisions additional capacity
  • Autoscaler signals the gateway when new clusters come online → gateway re-evaluates placement for queued / in-flight workloads

Together, Gateway + autoscaler form the closed-loop capacity-and- placement control plane.

Two user-exposed performance modes

The performance-mode user knob surfaces through the autoscaler:

  • Standard mode biases the autoscaler toward smaller VM sizes and tighter scaling (less cost, more variable latency)
  • Performance-Optimized mode biases toward larger VMs and looser scaling (faster startup + execution, higher cost)

See parent: systems/databricks-serverless-compute.

Seen in

  • sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performanceFirst canonical wiki naming of the Databricks Serverless Autoscaler. Canonicalises the horizontal + vertical two-axis adaptive scaling primitive + the OOM-recovery-via-VM-restart primitive as first wiki instances of both. Named customer outcomes (CKDelta / Unilever / HP) establish 12–15× speedup and 25–36% cost reduction as the impact envelope. Decisively contrasts with threshold-based autoscalers that surface OOM to users as job failures requiring manual cluster retuning.
Last updated · 451 distilled / 1,324 read