PATTERN Cited by 1 source
Transient Databricks cluster per run¶
Problem¶
Databricks clusters can be shared (interactive clusters, long-lived) or transient (job clusters, one per run). Shared clusters:
- Save on startup time (always warm).
- Share resources across concurrent jobs (packing efficiency).
- Expose every job to every other job's failures — a bad Spark plan, a driver OOM, or a runaway notebook can take down parallel unrelated jobs.
- Make per-run environment customisation hard.
For batch ML pipelines (nightly forecasts, daily optimisation), none of the shared-cluster benefits outweigh the failure-coupling and configuration-headache costs.
Pattern¶
Spin up a dedicated Databricks job cluster per pipeline run. Tear it down when the run completes.
- Each run gets its own driver + executors.
- Each run's configuration (Spark version, library dependencies, instance type) can differ from any other run's.
- A failure in one run's cluster cannot affect any other run — including parallel runs of the same pipeline for different partitions / tenants.
- Cost is amortised over the run's wall-clock; the cluster doesn't exist between runs.
When to use¶
- Batch pipelines with predictable wall-clock. Startup cost (~1–3 min for a modest cluster) amortises well over a multi-hour job.
- Parallel runs with failure isolation requirements. Running one forecast pipeline per merchant region in parallel; don't want a bad merchant's data crashing all regions.
- Per-run configuration differences. Different pipelines need different library versions, Spark configurations, or instance types.
When to avoid¶
- Interactive exploration / notebooks. Startup cost per query is unacceptable.
- Streaming pipelines. Need long-lived compute.
- Very short jobs. Startup cost dominates.
Canonical instance (Zalando ZEOS)¶
Zalando ZEOS's inventory-optimisation pipelines explicitly invoke this pattern as a scalability lever:
"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."
Paired with Delta Lake as the stable intermediate storage — one cluster writes to Delta, the next (in a different job cluster) reads from Delta.