CONCEPT Cited by 1 source
Transient job cluster¶
Definition¶
A transient job cluster is a compute cluster that is created on-demand for a single job run, used exclusively by that run, and torn down when the run completes. The counterpart is an interactive cluster or a long-lived shared cluster that multiple workloads share concurrently.
In the Databricks ecosystem, job clusters are the idiomatic
name for this: databricks jobs run spins up a dedicated
cluster, executes the notebook / JAR, and terminates.
Why transient¶
- Failure isolation. A failure in one run's cluster (OOM, driver crash, worker loss) cannot impact other running pipelines — each pipeline run lives in its own cluster.
- Resource isolation. Runs don't compete for executor slots, memory, or driver attention; throughput is predictable.
- Cost efficiency. Clusters only exist while actively doing work; there's no idle cost between runs.
- Simple configuration per run. Each pipeline run can specify its own Spark version, library dependencies, and cluster shape without coordinating with other workloads.
Tradeoff¶
- Startup cost — spinning up a job cluster adds minutes of latency per run. For batch pipelines (Zalando's weekly forecast, daily optimisation) this cost is amortised over a multi-hour job; for interactive / latency-sensitive work it's unacceptable.
Canonical instance (Zalando ZEOS)¶
Zalando ZEOS's inventory-optimisation system runs PySpark data pre-processing on Databricks transient job clusters. Explicit scalability rationale:
"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."
Pairs with Delta Lake as the intermediate storage — the transient cluster writes its output to Delta and terminates; the next stage (transformation in SageMaker Processing Job) reads from S3 at its own pace.