PATTERN Cited by 1 source

Managed replication platform¶

Summary¶

Instead of each team hand-assembling point-to-point pipelines between a source database and whatever downstream system they need (search, lake, analytics, another database, another region), a central platform team owns a managed data-replication platform: a unified service that provisions, operates, and customises change-data-capture pipelines on behalf of its internal tenants.

The pattern is the CDC-layer analogue of the hosted-Kafka / hosted-search / hosted-LLM-gateway moves in other platforms: pull a domain-specific piece of plumbing that everyone needs out of each team's codebase and run it as a product.

Problem it solves¶

Without a managed platform, every team building a CDC pipeline reassembles the same stack independently — enable logical replication, create users + publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sinks, monitor lag, handle schema migrations. Datadog names the cost shape:

"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Failure modes that accumulate across hand-built pipelines:

Inconsistent setup (one pipeline's slot lacks a heartbeat table → WAL bloat on the primary → outage).
Schema migration breakage (one team's SET NOT NULL coincidentally breaks another team's consumer).
Duplicated enrichment logic (every team that wants to add a timestamp or a tenant-id field re-implements it).
No unified monitoring (each team's pipeline has its own metrics, its own runbook, its own on-call).
Reinvented solutions to the same 7-step Postgres-to-Kafka runbook.

Forces¶

Speed of onboarding: a new team should be able to stand up a pipeline in hours, not weeks.
Consistency of operation: lag / backpressure / schema- compat / failover semantics identical across tenants.
Customisability per tenant: not every downstream system wants the same record shape, so per-tenant overrides are first-class, not a fork.
Defence in depth against schema drift: pre-deploy analysis of migration SQL + runtime registry enforcement.
Shared operational economy: the platform team absorbs upgrade / patching / tuning cost once for the whole company.

Solution shape (Datadog)¶

Datadog's platform combines five internal patterns:

patterns/debezium-kafka-connect-cdc-pipeline — the open-source CDC backbone (Postgres logical replication / Cassandra commit log → Debezium source connector → Kafka → Kafka Connect sink connector → destination system).
patterns/workflow-orchestrated-pipeline-provisioning — Temporal workflows decompose the provisioning runbook into modular reliable tasks stitched into higher-level orchestrations.
patterns/schema-validation-before-deploy — an automated schema-management validation system analyses migration SQL before it's applied, blocking pipeline- breaking changes like ALTER TABLE ... ALTER COLUMN ... SET NOT NULL.
patterns/schema-registry-backward-compat — a multi-tenant Kafka Schema Registry in backward-compat mode, integrated with source + sink connectors, catches runtime schema mismatches.
patterns/connector-transformations-plus-enrichment-api — Kafka Connect single-message transforms for per-tenant shape customisation at the transport layer, plus a centralised enrichment API on top of the search platform for shared derived-field logic.

Underpinning all five: an explicit choice of asynchronous replication as the foundation, trading strict consistency for scalability, availability, and throughput.

Result¶

Original Postgres-to-search pipeline generalised into Postgres-to-Postgres (Orgstore unwinding + backups), Postgres-to-Iceberg (analytics), Cassandra-to-X (source generalisation), and cross-region Kafka replication (data locality + resilience for Datadog On-Call).
Search query latency ↓ up to 87% on the motivating use case; page load ~30 s → ~1 s (up to 97%).
Replication lag ~500 ms.
Teams focus on innovation rather than repetitive pipeline plumbing, per Datadog's retrospective.

Caveats¶

The platform requires a platform-team commitment — it's not the right answer at 1-2 pipelines; it becomes the right answer when the hand-built-pipeline operational load dominates.
Async replication means every sink is eventually consistent; workloads that require same-transaction visibility across source + replica need a different substrate.
Backward-compat schema registry constrains schema evolution to additive changes + optional-field removals; breaking changes require coordinated rollouts.
Enrichment API centralises derivation logic, but also becomes a shared-bottleneck service — its availability budget must match the downstream ingestion budget.

Seen in¶

sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — canonical worked example: Datadog's managed multi-tenant CDC replication platform, seeded by a Postgres-to-search pipeline (Metrics Summary page, p90 7 s → 1 s) and generalised into five sink classes. All five internal patterns above are exercised in this one platform.

patterns/debezium-kafka-connect-cdc-pipeline — transport backbone.
patterns/workflow-orchestrated-pipeline-provisioning — provisioning layer.
patterns/schema-validation-before-deploy — offline schema-evolution gate.
patterns/schema-registry-backward-compat — runtime schema-evolution gate.
patterns/connector-transformations-plus-enrichment-api — per-tenant customisation surfaces.
concepts/change-data-capture — the class of replication this platform manages.
concepts/asynchronous-replication — the consistency posture chosen as the foundation.
companies/datadog — the platform's operator.