PATTERN Cited by 1 source

Transient Databricks cluster per run¶

Problem¶

Databricks clusters can be shared (interactive clusters, long-lived) or transient (job clusters, one per run). Shared clusters:

Save on startup time (always warm).
Share resources across concurrent jobs (packing efficiency).
Expose every job to every other job's failures — a bad Spark plan, a driver OOM, or a runaway notebook can take down parallel unrelated jobs.
Make per-run environment customisation hard.

For batch ML pipelines (nightly forecasts, daily optimisation), none of the shared-cluster benefits outweigh the failure-coupling and configuration-headache costs.

Pattern¶

Spin up a dedicated Databricks job cluster per pipeline run. Tear it down when the run completes.

Each run gets its own driver + executors.
Each run's configuration (Spark version, library dependencies, instance type) can differ from any other run's.
A failure in one run's cluster cannot affect any other run — including parallel runs of the same pipeline for different partitions / tenants.
Cost is amortised over the run's wall-clock; the cluster doesn't exist between runs.

When to use¶

Batch pipelines with predictable wall-clock. Startup cost (~1–3 min for a modest cluster) amortises well over a multi-hour job.
Parallel runs with failure isolation requirements. Running one forecast pipeline per merchant region in parallel; don't want a bad merchant's data crashing all regions.
Per-run configuration differences. Different pipelines need different library versions, Spark configurations, or instance types.

When to avoid¶

Interactive exploration / notebooks. Startup cost per query is unacceptable.
Streaming pipelines. Need long-lived compute.
Very short jobs. Startup cost dominates.

Canonical instance (Zalando ZEOS)¶

Zalando ZEOS's inventory-optimisation pipelines explicitly invoke this pattern as a scalability lever:

"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."

Paired with Delta Lake as the stable intermediate storage — one cluster writes to Delta, the next (in a different job cluster) reads from Delta.

Seen in¶

sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive