CONCEPT Cited by 1 source

Transient job cluster¶

Definition¶

A transient job cluster is a compute cluster that is created on-demand for a single job run, used exclusively by that run, and torn down when the run completes. The counterpart is an interactive cluster or a long-lived shared cluster that multiple workloads share concurrently.

In the Databricks ecosystem, job clusters are the idiomatic name for this: databricks jobs run spins up a dedicated cluster, executes the notebook / JAR, and terminates.

Why transient¶

Failure isolation. A failure in one run's cluster (OOM, driver crash, worker loss) cannot impact other running pipelines — each pipeline run lives in its own cluster.
Resource isolation. Runs don't compete for executor slots, memory, or driver attention; throughput is predictable.
Cost efficiency. Clusters only exist while actively doing work; there's no idle cost between runs.
Simple configuration per run. Each pipeline run can specify its own Spark version, library dependencies, and cluster shape without coordinating with other workloads.

Tradeoff¶

Startup cost — spinning up a job cluster adds minutes of latency per run. For batch pipelines (Zalando's weekly forecast, daily optimisation) this cost is amortised over a multi-hour job; for interactive / latency-sensitive work it's unacceptable.

Canonical instance (Zalando ZEOS)¶

Zalando ZEOS's inventory-optimisation system runs PySpark data pre-processing on Databricks transient job clusters. Explicit scalability rationale:

"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."

Pairs with Delta Lake as the intermediate storage — the transient cluster writes its output to Delta and terminates; the next stage (transformation in SageMaker Processing Job) reads from S3 at its own pace.

Seen in¶

sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive