PATTERN Cited by 1 source
Managed replication platform¶
Summary¶
Instead of each team hand-assembling point-to-point pipelines between a source database and whatever downstream system they need (search, lake, analytics, another database, another region), a central platform team owns a managed data-replication platform: a unified service that provisions, operates, and customises change-data-capture pipelines on behalf of its internal tenants.
The pattern is the CDC-layer analogue of the hosted-Kafka / hosted-search / hosted-LLM-gateway moves in other platforms: pull a domain-specific piece of plumbing that everyone needs out of each team's codebase and run it as a product.
Problem it solves¶
Without a managed platform, every team building a CDC pipeline reassembles the same stack independently — enable logical replication, create users + publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sinks, monitor lag, handle schema migrations. Datadog names the cost shape:
"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Failure modes that accumulate across hand-built pipelines:
- Inconsistent setup (one pipeline's slot lacks a heartbeat table → WAL bloat on the primary → outage).
- Schema migration breakage (one team's
SET NOT NULLcoincidentally breaks another team's consumer). - Duplicated enrichment logic (every team that wants to add a timestamp or a tenant-id field re-implements it).
- No unified monitoring (each team's pipeline has its own metrics, its own runbook, its own on-call).
- Reinvented solutions to the same 7-step Postgres-to-Kafka runbook.
Forces¶
- Speed of onboarding: a new team should be able to stand up a pipeline in hours, not weeks.
- Consistency of operation: lag / backpressure / schema- compat / failover semantics identical across tenants.
- Customisability per tenant: not every downstream system wants the same record shape, so per-tenant overrides are first-class, not a fork.
- Defence in depth against schema drift: pre-deploy analysis of migration SQL + runtime registry enforcement.
- Shared operational economy: the platform team absorbs upgrade / patching / tuning cost once for the whole company.
Solution shape (Datadog)¶
Datadog's platform combines five internal patterns:
- patterns/debezium-kafka-connect-cdc-pipeline — the open-source CDC backbone (Postgres logical replication / Cassandra commit log → Debezium source connector → Kafka → Kafka Connect sink connector → destination system).
- patterns/workflow-orchestrated-pipeline-provisioning — Temporal workflows decompose the provisioning runbook into modular reliable tasks stitched into higher-level orchestrations.
- patterns/schema-validation-before-deploy — an
automated schema-management validation system analyses
migration SQL before it's applied, blocking pipeline-
breaking changes like
ALTER TABLE ... ALTER COLUMN ... SET NOT NULL. - patterns/schema-registry-backward-compat — a multi-tenant Kafka Schema Registry in backward-compat mode, integrated with source + sink connectors, catches runtime schema mismatches.
- patterns/connector-transformations-plus-enrichment-api — Kafka Connect single-message transforms for per-tenant shape customisation at the transport layer, plus a centralised enrichment API on top of the search platform for shared derived-field logic.
Underpinning all five: an explicit choice of asynchronous replication as the foundation, trading strict consistency for scalability, availability, and throughput.
Result¶
- Original Postgres-to-search pipeline generalised into Postgres-to-Postgres (Orgstore unwinding + backups), Postgres-to-Iceberg (analytics), Cassandra-to-X (source generalisation), and cross-region Kafka replication (data locality + resilience for Datadog On-Call).
- Search query latency ↓ up to 87% on the motivating use case; page load ~30 s → ~1 s (up to 97%).
- Replication lag ~500 ms.
- Teams focus on innovation rather than repetitive pipeline plumbing, per Datadog's retrospective.
Caveats¶
- The platform requires a platform-team commitment — it's not the right answer at 1-2 pipelines; it becomes the right answer when the hand-built-pipeline operational load dominates.
- Async replication means every sink is eventually consistent; workloads that require same-transaction visibility across source + replica need a different substrate.
- Backward-compat schema registry constrains schema evolution to additive changes + optional-field removals; breaking changes require coordinated rollouts.
- Enrichment API centralises derivation logic, but also becomes a shared-bottleneck service — its availability budget must match the downstream ingestion budget.
Seen in¶
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — canonical worked example: Datadog's managed multi-tenant CDC replication platform, seeded by a Postgres-to-search pipeline (Metrics Summary page, p90 7 s → 1 s) and generalised into five sink classes. All five internal patterns above are exercised in this one platform.
Related¶
- patterns/debezium-kafka-connect-cdc-pipeline — transport backbone.
- patterns/workflow-orchestrated-pipeline-provisioning — provisioning layer.
- patterns/schema-validation-before-deploy — offline schema-evolution gate.
- patterns/schema-registry-backward-compat — runtime schema-evolution gate.
- patterns/connector-transformations-plus-enrichment-api — per-tenant customisation surfaces.
- concepts/change-data-capture — the class of replication this platform manages.
- concepts/asynchronous-replication — the consistency posture chosen as the foundation.
- companies/datadog — the platform's operator.