SYSTEM Cited by 1 source

Databricks Serverless Autoscaler¶

The Databricks Serverless Autoscaler is the adaptive cluster- sizing controller in Databricks Serverless Compute. It scales clusters both horizontally and vertically, and — the distinguishing feature — restarts OOM-failed tasks on larger VMs rather than surfacing the failure to the user.

Canonical framing (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

"Dynamic cluster sizing is the primary mechanism for optimizing price-performance in distributed systems, but determining the optimal amount of compute is inherently complex. The optimal configuration depends on workload characteristics, data size, and the relative importance of latency versus cost, with no single configuration working across all scenarios."

Why static autoscaling rules fail¶

Prior state of the art (canonicalised in the post):

"Traditional autoscaling approaches rely on static rules and reactive thresholds, which often fail to capture these nuances. As a result, clusters are frequently under or over-provisioned, leading to inefficiency, instability, or both."

Two-axis adaptive scaling¶

The Serverless Autoscaler departs from threshold-based scalers in two ways:

Signal-driven, not threshold-driven. "By continuously analyzing workload patterns and system-wide signals, the autoscaler positions each workload on the optimal cost- performance curve." See concepts/price-performance-ratio.
Horizontal + vertical. "It dynamically adjusts compute capacity by scaling horizontally and vertically as needed." See concepts/vertical-and-horizontal-autoscaling. This combination is rare — most autoscalers operate on a single axis.

The OOM-restart primitive¶

The most architecturally distinctive behaviour (canonical quote):

"When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required."

This is concepts/adaptive-oom-recovery — first wiki canonical instance. The primitive assumes task idempotency (which Spark's task-retry model already requires). When an OOM fires:

Autoscaler detects the OOM signal
Selects a larger VM type
Re-executes the failed task there
Job continues; no user intervention

Canonicalised in patterns/oom-aware-vm-restart-autoscaling.

Customer outcomes¶

Named production outcomes (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

CKDelta: "jobs completing in 20 minutes that previously ran for 4–5 hours" — 12–15× end-to-end speedup
Unilever: "pipelines running 2–5x faster" with "operational costs down 25%"
HP: "cloud savings of over 32%" and "decreased combined job runtime by 36%"

Interaction with the Gateway¶

The autoscaler is tightly coupled to the Serverless Gateway:

Gateway signals the autoscaler when no cluster has headroom for a workload → autoscaler provisions additional capacity
Autoscaler signals the gateway when new clusters come online → gateway re-evaluates placement for queued / in-flight workloads

Together, Gateway + autoscaler form the closed-loop capacity-and- placement control plane.

Two user-exposed performance modes¶

The performance-mode user knob surfaces through the autoscaler:

Standard mode biases the autoscaler toward smaller VM sizes and tighter scaling (less cost, more variable latency)
Performance-Optimized mode biases toward larger VMs and looser scaling (faster startup + execution, higher cost)

See parent: systems/databricks-serverless-compute.

Seen in¶

sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — First canonical wiki naming of the Databricks Serverless Autoscaler. Canonicalises the horizontal + vertical two-axis adaptive scaling primitive + the OOM-recovery-via-VM-restart primitive as first wiki instances of both. Named customer outcomes (CKDelta / Unilever / HP) establish 12–15× speedup and 25–36% cost reduction as the impact envelope. Decisively contrasts with threshold-based autoscalers that surface OOM to users as job failures requiring manual cluster retuning.

systems/databricks-serverless-compute — the product umbrella
systems/databricks-serverless-gateway — the routing partner in the closed-loop capacity controller
systems/spark-connect — the substrate that makes task-granular OOM recovery possible without user-process restart
concepts/vertical-and-horizontal-autoscaling — two-axis adaptive scaling primitive
concepts/adaptive-oom-recovery — the OOM-restart primitive
concepts/price-performance-ratio — the optimisation target
patterns/oom-aware-vm-restart-autoscaling — the generalisable pattern
companies/databricks — engineering blog hub