PATTERN Cited by 1 source

Workflow-orchestrated pipeline provisioning¶

Summary¶

When a platform needs to provision multi-step data-plane infrastructure repeatedly — each instance composed of N precise configuration steps that must all succeed or roll back together — express the provisioning flow as a durable workflow (e.g. Temporal) of modular, retryable tasks, rather than as a runbook executed by an engineer or a shell script invoked by CI.

Problem¶

Complex multi-step provisioning runbooks produce exponential operational load as they scale out across many pipelines × data centres × teams:

Steps fail independently; partial-success states require manual forensic triage.
Idempotency is not free — each step must be safe to re-run.
Long-running steps (deploy instances, wait for a slot to attach) strain shell-based drivers.
Error handling, retry, and rollback logic tend to be copy-pasted + drifted across teams.
There's no visible audit trail of which step in which instance is currently running.

Datadog, 2025-11-04:

"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Solution¶

Decompose the provisioning runbook into modular, reliable tasks, then stitch them together into higher-level orchestrations using a durable-workflow engine.

Datadog's framing:

"Using Temporal workflows, we broke the provisioning process into modular, reliable tasks — then stitched them together into higher-level orchestrations. This made it easy for teams to create, manage, and experiment with new replication pipelines without getting bogged down in manual, error-prone steps." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Concretely, for a Postgres-to-Kafka-to-Elasticsearch CDC pipeline, the 7-step manual runbook (enable wal_level=logical, create Postgres users, create publications + slots, deploy Debezium, create Kafka topics, set up heartbeat tables, configure sink connector) becomes 7+ small Temporal activities, composed into a single ProvisionReplicationPipeline workflow.

Durable-execution properties the pattern inherits:

Retry per activity (configurable backoff, max attempts).
Replay — a workflow worker can crash and resume from the event history without re-running completed activities.
Timers — long waits (for a slot to reach a certain LSN, for a sink to catch up) encoded as first-class sleep nodes.
Compensations — rollback branches for partial-failure cleanup (drop the slot, remove the publication, delete the topic).
Audit trail — the event history of every provision instance is queryable.

Benefits¶

Linear operational load — adding a new pipeline is adding a new workflow execution, not a new runbook.
Self-service — teams can trigger provisioning workflows without platform-team intervention.
Safe experimentation — a failed provision is a failed workflow, not a half-configured Postgres instance.
Consistency — every pipeline is set up identically because the workflow definition is the spec.
Observability — the workflow engine's UI / metrics surface per-step progress, retry counts, and failures.

Contrast with alternatives¶

Approach	Retry	Rollback	Long wait	Audit	Drift across teams
Manual runbook	No	No	—	No	High
Shell script / Ansible	Ad hoc	Ad hoc	Brittle	Minimal	High
CI job	Limited	Limited	Bad fit	Yes	Medium
Workflow engine (Temporal / Cadence / Airflow)	Yes	Yes	First-class	Yes	Low

The workflow-engine row is the only one that gets all five properties without bespoke implementation per team.

Generalisation beyond CDC¶

The pattern applies wherever provisioning has:

Many tenants × many instances multiplying through the cost.
Multi-step configuration with independent failure modes.
Long-running async operations (wait for a resource to become ready).
Rollback semantics that matter (partial success is not acceptable).

Canonical instances across the wiki include infrastructure provisioning in general, cloud-resource lifecycle management, multi-step migrations between storage systems, and — as documented here — CDC pipeline provisioning.

Caveats¶

Introduces a dependency on the workflow engine itself — the engine must be operated as a reliable platform service (or consumed as a hosted product).
Workflow-vs-activity boundary design matters: too granular and the event history bloats; too coarse and you lose retry/compensation granularity.
Some provisioning side effects are not idempotent (creating a user, issuing a secret) — activities must be written with explicit idempotency keys or compensating actions.

Seen in¶

sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Datadog uses Temporal workflows to automate the 7-step manual CDC-pipeline provisioning runbook (enable logical replication, create users, create publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sink connectors). Canonical wiki instance of the pattern; the retrospective explicitly names automation "a foundational principle" and the exponential- load-growth motivation.

systems/temporal — the workflow engine Datadog used.
concepts/durable-execution — the broader concept Temporal realises.
patterns/managed-replication-platform — the full platform shape this provisioning-automation layer fits into.
patterns/debezium-kafka-connect-cdc-pipeline — the transport backbone whose provisioning this pattern automates.