Skip to content

CONCEPT Cited by 1 source

Transient job cluster

Definition

A transient job cluster is a compute cluster that is created on-demand for a single job run, used exclusively by that run, and torn down when the run completes. The counterpart is an interactive cluster or a long-lived shared cluster that multiple workloads share concurrently.

In the Databricks ecosystem, job clusters are the idiomatic name for this: databricks jobs run spins up a dedicated cluster, executes the notebook / JAR, and terminates.

Why transient

  • Failure isolation. A failure in one run's cluster (OOM, driver crash, worker loss) cannot impact other running pipelines — each pipeline run lives in its own cluster.
  • Resource isolation. Runs don't compete for executor slots, memory, or driver attention; throughput is predictable.
  • Cost efficiency. Clusters only exist while actively doing work; there's no idle cost between runs.
  • Simple configuration per run. Each pipeline run can specify its own Spark version, library dependencies, and cluster shape without coordinating with other workloads.

Tradeoff

  • Startup cost — spinning up a job cluster adds minutes of latency per run. For batch pipelines (Zalando's weekly forecast, daily optimisation) this cost is amortised over a multi-hour job; for interactive / latency-sensitive work it's unacceptable.

Canonical instance (Zalando ZEOS)

Zalando ZEOS's inventory-optimisation system runs PySpark data pre-processing on Databricks transient job clusters. Explicit scalability rationale:

"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."

Pairs with Delta Lake as the intermediate storage — the transient cluster writes its output to Delta and terminates; the next stage (transformation in SageMaker Processing Job) reads from S3 at its own pace.

Seen in

Last updated · 501 distilled / 1,218 read