Skip to content

PATTERN Cited by 1 source

Transient Databricks cluster per run

Problem

Databricks clusters can be shared (interactive clusters, long-lived) or transient (job clusters, one per run). Shared clusters:

  • Save on startup time (always warm).
  • Share resources across concurrent jobs (packing efficiency).
  • Expose every job to every other job's failures — a bad Spark plan, a driver OOM, or a runaway notebook can take down parallel unrelated jobs.
  • Make per-run environment customisation hard.

For batch ML pipelines (nightly forecasts, daily optimisation), none of the shared-cluster benefits outweigh the failure-coupling and configuration-headache costs.

Pattern

Spin up a dedicated Databricks job cluster per pipeline run. Tear it down when the run completes.

  • Each run gets its own driver + executors.
  • Each run's configuration (Spark version, library dependencies, instance type) can differ from any other run's.
  • A failure in one run's cluster cannot affect any other run — including parallel runs of the same pipeline for different partitions / tenants.
  • Cost is amortised over the run's wall-clock; the cluster doesn't exist between runs.

When to use

  • Batch pipelines with predictable wall-clock. Startup cost (~1–3 min for a modest cluster) amortises well over a multi-hour job.
  • Parallel runs with failure isolation requirements. Running one forecast pipeline per merchant region in parallel; don't want a bad merchant's data crashing all regions.
  • Per-run configuration differences. Different pipelines need different library versions, Spark configurations, or instance types.

When to avoid

  • Interactive exploration / notebooks. Startup cost per query is unacceptable.
  • Streaming pipelines. Need long-lived compute.
  • Very short jobs. Startup cost dominates.

Canonical instance (Zalando ZEOS)

Zalando ZEOS's inventory-optimisation pipelines explicitly invoke this pattern as a scalability lever:

"Every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."

Paired with Delta Lake as the stable intermediate storage — one cluster writes to Delta, the next (in a different job cluster) reads from Delta.

Seen in

Last updated · 501 distilled / 1,218 read