Skip to content

PATTERN Cited by 1 source

Workflow-orchestrated pipeline provisioning

Summary

When a platform needs to provision multi-step data-plane infrastructure repeatedly — each instance composed of N precise configuration steps that must all succeed or roll back together — express the provisioning flow as a durable workflow (e.g. Temporal) of modular, retryable tasks, rather than as a runbook executed by an engineer or a shell script invoked by CI.

Problem

Complex multi-step provisioning runbooks produce exponential operational load as they scale out across many pipelines × data centres × teams:

  • Steps fail independently; partial-success states require manual forensic triage.
  • Idempotency is not free — each step must be safe to re-run.
  • Long-running steps (deploy instances, wait for a slot to attach) strain shell-based drivers.
  • Error handling, retry, and rollback logic tend to be copy-pasted + drifted across teams.
  • There's no visible audit trail of which step in which instance is currently running.

Datadog, 2025-11-04:

"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Solution

Decompose the provisioning runbook into modular, reliable tasks, then stitch them together into higher-level orchestrations using a durable-workflow engine.

Datadog's framing:

"Using Temporal workflows, we broke the provisioning process into modular, reliable tasks — then stitched them together into higher-level orchestrations. This made it easy for teams to create, manage, and experiment with new replication pipelines without getting bogged down in manual, error-prone steps." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Concretely, for a Postgres-to-Kafka-to-Elasticsearch CDC pipeline, the 7-step manual runbook (enable wal_level=logical, create Postgres users, create publications + slots, deploy Debezium, create Kafka topics, set up heartbeat tables, configure sink connector) becomes 7+ small Temporal activities, composed into a single ProvisionReplicationPipeline workflow.

Durable-execution properties the pattern inherits:

  • Retry per activity (configurable backoff, max attempts).
  • Replay — a workflow worker can crash and resume from the event history without re-running completed activities.
  • Timers — long waits (for a slot to reach a certain LSN, for a sink to catch up) encoded as first-class sleep nodes.
  • Compensations — rollback branches for partial-failure cleanup (drop the slot, remove the publication, delete the topic).
  • Audit trail — the event history of every provision instance is queryable.

Benefits

  • Linear operational load — adding a new pipeline is adding a new workflow execution, not a new runbook.
  • Self-service — teams can trigger provisioning workflows without platform-team intervention.
  • Safe experimentation — a failed provision is a failed workflow, not a half-configured Postgres instance.
  • Consistency — every pipeline is set up identically because the workflow definition is the spec.
  • Observability — the workflow engine's UI / metrics surface per-step progress, retry counts, and failures.

Contrast with alternatives

Approach Retry Rollback Long wait Audit Drift across teams
Manual runbook No No No High
Shell script / Ansible Ad hoc Ad hoc Brittle Minimal High
CI job Limited Limited Bad fit Yes Medium
Workflow engine (Temporal / Cadence / Airflow) Yes Yes First-class Yes Low

The workflow-engine row is the only one that gets all five properties without bespoke implementation per team.

Generalisation beyond CDC

The pattern applies wherever provisioning has:

  • Many tenants × many instances multiplying through the cost.
  • Multi-step configuration with independent failure modes.
  • Long-running async operations (wait for a resource to become ready).
  • Rollback semantics that matter (partial success is not acceptable).

Canonical instances across the wiki include infrastructure provisioning in general, cloud-resource lifecycle management, multi-step migrations between storage systems, and — as documented here — CDC pipeline provisioning.

Caveats

  • Introduces a dependency on the workflow engine itself — the engine must be operated as a reliable platform service (or consumed as a hosted product).
  • Workflow-vs-activity boundary design matters: too granular and the event history bloats; too coarse and you lose retry/compensation granularity.
  • Some provisioning side effects are not idempotent (creating a user, issuing a secret) — activities must be written with explicit idempotency keys or compensating actions.

Seen in

  • sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Datadog uses Temporal workflows to automate the 7-step manual CDC-pipeline provisioning runbook (enable logical replication, create users, create publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sink connectors). Canonical wiki instance of the pattern; the retrospective explicitly names automation "a foundational principle" and the exponential- load-growth motivation.
Last updated · 200 distilled / 1,178 read