Skip to content

PATTERN Cited by 1 source

Managed replication platform

Summary

Instead of each team hand-assembling point-to-point pipelines between a source database and whatever downstream system they need (search, lake, analytics, another database, another region), a central platform team owns a managed data-replication platform: a unified service that provisions, operates, and customises change-data-capture pipelines on behalf of its internal tenants.

The pattern is the CDC-layer analogue of the hosted-Kafka / hosted-search / hosted-LLM-gateway moves in other platforms: pull a domain-specific piece of plumbing that everyone needs out of each team's codebase and run it as a product.

Problem it solves

Without a managed platform, every team building a CDC pipeline reassembles the same stack independently — enable logical replication, create users + publications + slots, deploy Debezium, create topics, set up heartbeat tables, configure sinks, monitor lag, handle schema migrations. Datadog names the cost shape:

"When replicated across many pipelines and data centers, the operational load grew exponentially." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Failure modes that accumulate across hand-built pipelines:

  • Inconsistent setup (one pipeline's slot lacks a heartbeat table → WAL bloat on the primary → outage).
  • Schema migration breakage (one team's SET NOT NULL coincidentally breaks another team's consumer).
  • Duplicated enrichment logic (every team that wants to add a timestamp or a tenant-id field re-implements it).
  • No unified monitoring (each team's pipeline has its own metrics, its own runbook, its own on-call).
  • Reinvented solutions to the same 7-step Postgres-to-Kafka runbook.

Forces

  • Speed of onboarding: a new team should be able to stand up a pipeline in hours, not weeks.
  • Consistency of operation: lag / backpressure / schema- compat / failover semantics identical across tenants.
  • Customisability per tenant: not every downstream system wants the same record shape, so per-tenant overrides are first-class, not a fork.
  • Defence in depth against schema drift: pre-deploy analysis of migration SQL + runtime registry enforcement.
  • Shared operational economy: the platform team absorbs upgrade / patching / tuning cost once for the whole company.

Solution shape (Datadog)

Datadog's platform combines five internal patterns:

  1. patterns/debezium-kafka-connect-cdc-pipeline — the open-source CDC backbone (Postgres logical replication / Cassandra commit log → Debezium source connector → Kafka → Kafka Connect sink connector → destination system).
  2. patterns/workflow-orchestrated-pipeline-provisioningTemporal workflows decompose the provisioning runbook into modular reliable tasks stitched into higher-level orchestrations.
  3. patterns/schema-validation-before-deploy — an automated schema-management validation system analyses migration SQL before it's applied, blocking pipeline- breaking changes like ALTER TABLE ... ALTER COLUMN ... SET NOT NULL.
  4. patterns/schema-registry-backward-compat — a multi-tenant Kafka Schema Registry in backward-compat mode, integrated with source + sink connectors, catches runtime schema mismatches.
  5. patterns/connector-transformations-plus-enrichment-api — Kafka Connect single-message transforms for per-tenant shape customisation at the transport layer, plus a centralised enrichment API on top of the search platform for shared derived-field logic.

Underpinning all five: an explicit choice of asynchronous replication as the foundation, trading strict consistency for scalability, availability, and throughput.

Result

  • Original Postgres-to-search pipeline generalised into Postgres-to-Postgres (Orgstore unwinding + backups), Postgres-to-Iceberg (analytics), Cassandra-to-X (source generalisation), and cross-region Kafka replication (data locality + resilience for Datadog On-Call).
  • Search query latency ↓ up to 87% on the motivating use case; page load ~30 s → ~1 s (up to 97%).
  • Replication lag ~500 ms.
  • Teams focus on innovation rather than repetitive pipeline plumbing, per Datadog's retrospective.

Caveats

  • The platform requires a platform-team commitment — it's not the right answer at 1-2 pipelines; it becomes the right answer when the hand-built-pipeline operational load dominates.
  • Async replication means every sink is eventually consistent; workloads that require same-transaction visibility across source + replica need a different substrate.
  • Backward-compat schema registry constrains schema evolution to additive changes + optional-field removals; breaking changes require coordinated rollouts.
  • Enrichment API centralises derivation logic, but also becomes a shared-bottleneck service — its availability budget must match the downstream ingestion budget.

Seen in

Last updated · 200 distilled / 1,178 read