PATTERN Cited by 1 source
Workflow-orchestrated pipeline provisioning¶
Summary¶
When a platform needs to provision multi-step data-plane infrastructure repeatedly — each instance composed of N precise configuration steps that must all succeed or roll back together — express the provisioning flow as a durable workflow (e.g. Temporal) of modular, retryable tasks, rather than as a runbook executed by an engineer or a shell script invoked by CI.
Problem¶
Complex multi-step provisioning runbooks produce exponential operational load as they scale out across many pipelines × data centres × teams:
- Steps fail independently; partial-success states require manual forensic triage.
- Idempotency is not free — each step must be safe to re-run.
- Long-running steps (deploy instances, wait for a slot to attach) strain shell-based drivers.
- Error handling, retry, and rollback logic tend to be copy-pasted + drifted across teams.
- There's no visible audit trail of which step in which instance is currently running.
Datadog, 2025-11-04:
"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Solution¶
Decompose the provisioning runbook into modular, reliable tasks, then stitch them together into higher-level orchestrations using a durable-workflow engine.
Datadog's framing:
"Using Temporal workflows, we broke the provisioning process into modular, reliable tasks — then stitched them together into higher-level orchestrations. This made it easy for teams to create, manage, and experiment with new replication pipelines without getting bogged down in manual, error-prone steps." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Concretely, for a Postgres-to-Kafka-to-Elasticsearch CDC pipeline,
the 7-step manual runbook (enable wal_level=logical, create
Postgres users, create publications + slots, deploy Debezium,
create Kafka topics, set up heartbeat tables, configure sink
connector) becomes 7+ small Temporal activities, composed into a
single ProvisionReplicationPipeline workflow.
Durable-execution properties the pattern inherits:
- Retry per activity (configurable backoff, max attempts).
- Replay — a workflow worker can crash and resume from the event history without re-running completed activities.
- Timers — long waits (for a slot to reach a certain LSN, for a sink to catch up) encoded as first-class sleep nodes.
- Compensations — rollback branches for partial-failure cleanup (drop the slot, remove the publication, delete the topic).
- Audit trail — the event history of every provision instance is queryable.
Benefits¶
- Linear operational load — adding a new pipeline is adding a new workflow execution, not a new runbook.
- Self-service — teams can trigger provisioning workflows without platform-team intervention.
- Safe experimentation — a failed provision is a failed workflow, not a half-configured Postgres instance.
- Consistency — every pipeline is set up identically because the workflow definition is the spec.
- Observability — the workflow engine's UI / metrics surface per-step progress, retry counts, and failures.
Contrast with alternatives¶
| Approach | Retry | Rollback | Long wait | Audit | Drift across teams |
|---|---|---|---|---|---|
| Manual runbook | No | No | — | No | High |
| Shell script / Ansible | Ad hoc | Ad hoc | Brittle | Minimal | High |
| CI job | Limited | Limited | Bad fit | Yes | Medium |
| Workflow engine (Temporal / Cadence / Airflow) | Yes | Yes | First-class | Yes | Low |
The workflow-engine row is the only one that gets all five properties without bespoke implementation per team.
Generalisation beyond CDC¶
The pattern applies wherever provisioning has:
- Many tenants × many instances multiplying through the cost.
- Multi-step configuration with independent failure modes.
- Long-running async operations (wait for a resource to become ready).
- Rollback semantics that matter (partial success is not acceptable).
Canonical instances across the wiki include infrastructure provisioning in general, cloud-resource lifecycle management, multi-step migrations between storage systems, and — as documented here — CDC pipeline provisioning.
Caveats¶
- Introduces a dependency on the workflow engine itself — the engine must be operated as a reliable platform service (or consumed as a hosted product).
- Workflow-vs-activity boundary design matters: too granular and the event history bloats; too coarse and you lose retry/compensation granularity.
- Some provisioning side effects are not idempotent (creating a user, issuing a secret) — activities must be written with explicit idempotency keys or compensating actions.
Seen in¶
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Datadog uses Temporal workflows to automate the 7-step manual CDC-pipeline provisioning runbook (enable logical replication, create users, create publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sink connectors). Canonical wiki instance of the pattern; the retrospective explicitly names automation "a foundational principle" and the exponential- load-growth motivation.
Related¶
- systems/temporal — the workflow engine Datadog used.
- concepts/durable-execution — the broader concept Temporal realises.
- patterns/managed-replication-platform — the full platform shape this provisioning-automation layer fits into.
- patterns/debezium-kafka-connect-cdc-pipeline — the transport backbone whose provisioning this pattern automates.