DATADOG 2025-11-04

Replication redefined: How we built a low-latency, multi-tenant data replication platform¶

Datadog Engineering retrospective (2025-11-04) on building the internal managed data-replication platform that powers Postgres-to-Elasticsearch, Postgres-to-Postgres, Postgres-to- Iceberg, Cassandra-to-X, and cross-region Kafka flows across the company. The narrative starts with a single Metrics Summary page slow-query incident on a shared Postgres and ends with a generalised multi-tenant CDC platform provisioned through Temporal workflows, schema-evolution- safe via a backward-compat Kafka Schema Registry, and customisable per tenant through Kafka Connect single-message transforms + a standardised enrichment API.

Summary¶

Datadog's Metrics Summary page — joining ~82K active metrics against ~817K metric configurations per org query on a shared Postgres — had grown to p90 ~7 s page latency with every facet adjustment triggering another expensive join. Rather than push Postgres up the index/vertical-scaling curve, Datadog concluded that real-time search + faceted filtering is a fundamentally different workload from OLTP and rerouted those queries to a dedicated search platform, with data dynamically denormalised during replication from Postgres into search-engine documents. Page load times dropped by up to 97% (~30 s → ~1 s) with replication lag ~500 ms. That one pipeline (Debezium → Kafka → sink connector into the search platform) became the seed of a managed multi-tenant replication platform. Four architectural pillars: (1) automate pipeline provisioning with Temporal workflows (modular reliable tasks stitched into higher-level orchestrations, replacing the ~7-step manual runbook per pipeline); (2) asynchronous replication as the foundation, chosen explicitly over synchronous to prioritise scalability, availability, and throughput over strict consistency; (3) schema-evolution safety via a two-part solution — an internal automated schema management validation system that analyses migration SQL before it's applied (blocking e.g. ALTER TABLE ... ALTER COLUMN ... SET NOT NULL that would break in-flight messages with null fields) + a multi-tenant Kafka Schema Registry configured for backward compatibility so new schemas must still work for older consumers; (4) pipeline customisations via Kafka Connect single-message transforms (dynamic topic renaming, column type conversion, composite primary keys by field concatenation, add/drop columns) + a standardised enrichment API sitting atop the search platform for derived-field / metadata augmentation. One pipeline grew into Postgres-to-Elasticsearch, Postgres-to-Postgres (to unwind Datadog's shared monolithic database), Postgres-to- Iceberg for event-driven analytics, Cassandra replication, and cross-region Kafka replication for Datadog On-Call data locality and resilience.

Key takeaways¶

Search workloads are not OLTP workloads — decoupling them via replication was a 97% latency win. The Metrics Summary page (82K metrics × 817K configurations joined per query) went from p90 ~7 s → ~1 s (~30 s → ~1 s cited as the up-to bound) after Datadog rerouted search + facet queries to a dedicated search platform populated by replication-time denormalisation. Replication lag held around 500 ms — the asynchronous- replication premise held at production scale. This is a canonical worked example of the database-and-search duality problem solved in the opposite direction from MongoDB Atlas's consolidation pitch: split by workload rather than merge, at the cost of running a CDC platform. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Manual pipeline provisioning has exponential operational cost at scale — Temporal workflows collapse it to modular composable tasks. Each pipeline required 7 precise repeatable steps: enable Postgres logical replication (set wal_level to logical), create + configure Postgres users with right permissions, establish replication objects (publishers + slots), deploy Debezium instances capturing changes to Kafka, create Kafka topics mapped correctly to each Debezium instance, set up heartbeat tables for WAL retention + monitoring, configure sink connectors from Kafka into search. "Replicated across many pipelines and data centers, the operational load grew exponentially." Datadog made automation a foundational principle via Temporal workflows — decomposing provisioning into modular reliable tasks stitched into higher-level orchestrations. Teams could create, manage, and experiment with new pipelines without manual error-prone steps. patterns/workflow-orchestrated-pipeline-provisioning names the general shape. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Asynchronous replication was a deliberate choice — scalability and availability over strict consistency. Synchronous replication acknowledges a write only after all replicas confirm receipt; guarantees strong consistency but introduces latency + operational complexity "especially at scale and across distributed environments." Asynchronous replication lets the primary ack immediately, replicating afterward — "inherently more scalable and resilient in large-scale, high-throughput environments like Datadog's — it decouples application performance from network latency and replica responsiveness." Datadog's stated priorities favour scalability over strict consistency, so asynchronous replication became the platform foundation. The trade-off is accepted minor data lag during failures. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Debezium + Kafka Connect is the canonical open-source CDC pipeline backbone. Debezium performs CDC from databases (Postgres, Cassandra, etc.); Kafka Connect provides the scalable fault-tolerant data-movement plane between systems. Together they form the core of Datadog's managed replication platform, enabling flexible reliable extensible data sharing across the company. Source connectors pull changes into Kafka topics; sink connectors move Kafka records into downstream systems (search, Iceberg, another Postgres, Cassandra, cross-region Kafka). (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Schema evolution is the hard problem in async CDC — Datadog solves it in two halves: validate-before-apply and backward-compat registry. Half one: an internal automated schema management validation system analyses schema migration SQL before it's applied to the database to catch pipeline-breaking changes. Concrete example given: block ALTER TABLE ... ALTER COLUMN ... SET NOT NULL because not all in-flight messages are guaranteed to populate that column — a consumer receiving a null would fail. Validation checks approve most changes without manual intervention; breaking changes trigger direct coordination with the team for a safe rollout. Half two: a multi-tenant Kafka Schema Registry integrated with source + sink connectors, configured for backward compatibility — "new schemas must still allow older consumers to read data without errors." In practice, this limits schema changes to safe operations (adding optional fields, removing existing ones). When Debezium captures an updated schema, it serialises data in Avro format and pushes both the data and the schema update to Kafka topic + Schema Registry; the registry compares the new schema against the stored one and accepts or rejects based on the compatibility mode. Custom Kafka consumers outside the platform rely on the same registry, so schema compatibility protects internal and external consumers. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
Kafka Connect single-message transforms + enrichment API = per-tenant pipeline customisation without forking the pipeline. "One-size-fits-all pipeline didn't scale well with growing requirements." Some teams needed filtered/denormalised data, others custom enrichment logic, others to reshape record structure pre-storage. Datadog designed the pipeline as modular, adaptable, customisable via two mechanisms: (a) Kafka Connect single-message transforms — dynamically change topic names, transform column types, generate composite primary keys by concatenating fields, add/drop columns at the message level, without changes at the source. Where out-of-the- box transforms fell short, Datadog maintained custom forks to introduce Datadog-specific logic and optimisations. (b) A standardised enrichment API sitting on top of the search platform, providing a central way to request enrichments during or after ingestion. By centralising enrichment logic, Datadog avoided duplicating it across individual pipelines — teams keep consistency while reducing ingestion-flow complexity. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
A single-purpose Postgres-to-search pipeline generalised into a multi-product CDC platform. The original use case was sending SQL→search updates. The generalised platform now powers: Postgres-to-Postgres replication (to unwind the shared monolithic database + back up Orgstore); Postgres-to- Iceberg for scalable event-driven analytics; Cassandra replication (sourcing beyond SQL); cross-region Kafka replication improving data locality and resilience for Datadog On-Call. The five architectural decisions Datadog retrospectively names as load-bearing: moved search workloads off Postgres (join-heavy-query-latency ↓ up to 87% in the post's summary bullet; the body's ~30 s → ~1 s number frames it as ↓97% end-to- end page-load), asynchronous replication for availability + throughput, automated pipeline provisioning with Temporal, schema-compat via validation + Schema Registry, per-tenant customisation via transforms + enrichment API. (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Architectural pillars extracted¶

1. Async CDC pipeline backbone¶

Source DB (Postgres/Cassandra)
   │ logical replication / CDC
   ▼
Debezium source connector ───▶ Kafka topic (Avro-serialised)
                                │        │
                                ▼        ▼
                    Kafka Schema Registry (backward-compat)
                                │
                                ▼
                    Sink connector (Kafka Connect)
                    + Single-Message Transforms
                    + Custom fork logic
                                │
                                ▼
                 Elasticsearch / Postgres / Iceberg /
                 Cassandra / another Kafka (cross-region)
                                │
                                ▼
                 Enrichment API (derived fields, metadata)

Numbers stated in the post:

Replication lag: ~500 ms on the Metrics Summary pipeline.
Page latency: p90 ~7 s → ~1 s (up to 97% improvement; 87% cited in the summary list for search-query latency specifically).
Scale of the motivating query: 82K active metrics × 817K metric configurations joined per org per query.
Threshold at which warning signs became sufficient: 50,000 metrics per org across multiple orgs.

2. Provisioning-time automation via Temporal¶

The post names the 7-step manual runbook the platform replaced:

Enable logical replication on Postgres (wal_level=logical).
Create + configure Postgres users with correct permissions.
Establish replication objects (publishers + slots).
Deploy Debezium instances capturing changes into Kafka.
Create Kafka topics + map each Debezium instance correctly.
Set up heartbeat tables to address WAL retention + monitoring.
Configure sink connectors to move data from Kafka to destination.

Each becomes a modular Temporal task (durable, retryable, replayable); higher-level provisioning orchestrations stitch them together. The post explicitly frames this against the exponential- operational-load-growth shape of manual provisioning across many pipelines × data centres.

3. Schema-evolution safety — two layers¶

Layer	Mechanism	What it catches
Pre-deploy (offline)	Automated schema management validation analysing migration SQL	Structural-breaking changes like `SET NOT NULL` on a column that in-flight messages might not populate
Runtime (online)	Multi-tenant Kafka Schema Registry in backward-compat mode; Avro-serialised data + schema update pushed together	Schema-incompatible updates rejected at registry; only additive or subtractive-of-optional changes allowed

Composition is defence in depth: the offline check catches changes that never reach production; the runtime check catches the residual class that slip through.

4. Per-tenant customisation surfaces¶

Lever	Scope	Examples given
Kafka Connect single-message transforms	Per-connector	Dynamic topic renaming, column type conversion, composite primary keys via field concat, column add/drop
Custom connector forks	Per-advanced-use-case	Datadog-specific logic and optimisations where OSS transforms fell short
Enrichment API on top of search platform	Per-tenant, central	Derived fields, metadata augmentation during/after ingestion

The centralised enrichment API is the explicit anti-duplication move: "By centralising enrichment logic, we avoided duplicating it across individual pipelines." Same motivation as the central-proxy-choke-point pattern in AI-gateway contexts.

Use cases unlocked¶

The original single-purpose pipeline generalised to:

Postgres → Postgres — unwinding the shared monolithic database; reliable Orgstore backups.
Postgres → Iceberg — scalable event-driven analytics. Links out to Apache Flink CDC's Iceberg connector.
Cassandra replication — source-type generalisation beyond SQL; wiki expands this to Cassandra-to-X pipelines.
Cross-region Kafka replication — data locality + resilience for Datadog On-Call.

Numbers and operational details¶

p90 Metrics Summary page latency: ~7 s on shared Postgres before re-architecture.
Page load improvement post-replication: up to 97% (~30 s → ~1 s); summary list cites 87% latency reduction for the search-query component specifically.
Replication lag: ~500 ms.
Data-shape driving the problem: joining 82,000 active metrics × 817,000 metric configurations for every query.
Threshold at which scaling pressure became obvious: 50,000 metrics per org across multiple orgs.
Manual provisioning step count: 7 discrete steps per pipeline, multiplied across pipelines and data centres for exponential operational load.
Kafka Schema Registry compatibility mode: backward (new schemas work for older consumers).
Serialisation format: Avro.

Caveats / what the post does not quantify¶

No throughput numbers for Debezium → Kafka or Kafka Connect sinks; the post is written at architecture-pillar resolution.
No operational cost of running the Schema Registry or Temporal cluster at Datadog scale disclosed.
No detail on custom connector fork semantics — we know Datadog maintains forks and has "Datadog-specific logic and optimisations" but no specifics on the divergences.
No failure-mode retrospective on what did not work when the platform scaled — no incident-review numbers or revert examples; the post is forward-narrative.
Consistency trade-off narrated qualitatively — "minor data lag during failures" is the only concrete downside of async replication named; no data-loss / reordering / duplication semantics for sinks (which specific sinks are at-least-once vs exactly-once at Datadog).
"Up to 87% latency reduction" in the summary list vs. the body's ~30 s → ~1 s (97%) page-load number — plausibly the 87% is the search-engine-query latency specifically and 97% is end-to-end page-load, but the post does not reconcile them explicitly.
Enrichment API contract not documented — we know it's standardised and sits atop the search platform, but request shape / failure-mode / latency budget not disclosed.
No number on how many pipelines / tenants currently run on the platform in 2025.

Source¶

companies/datadog — company page.
concepts/change-data-capture — CDC as the upstream substrate; this source is Datadog's canonical operational instance alongside the table-format-compaction instances the concept page already carries.
concepts/asynchronous-replication — the platform's foundational choice.
concepts/schema-evolution — the hard problem in async CDC.
concepts/wal-write-ahead-logging — Postgres logical replication ≡ WAL streaming under wal_level=logical.
concepts/synchronization-tax — the class of cost Datadog paid to split database + search; Cars24 on MongoDB Atlas paid the opposite class to consolidate them.
systems/debezium — CDC source connectors.
systems/kafka-connect — scalable fault-tolerant data- movement plane; single-message transforms as the per-tenant customisation surface.
systems/kafka-schema-registry — backward-compat enforcement for in-flight schemas.
systems/temporal — workflow orchestration that replaced the 7-step manual pipeline runbook.
systems/apache-iceberg — Postgres→Iceberg is one of the derivative pipelines unlocked.
patterns/managed-replication-platform — named pattern for Datadog's full shape.
patterns/debezium-kafka-connect-cdc-pipeline — the OSS backbone.
patterns/workflow-orchestrated-pipeline-provisioning — the provisioning-automation layer.
patterns/schema-validation-before-deploy — offline pipeline-breaking-change detection.
patterns/schema-registry-backward-compat — runtime schema- incompat protection.
patterns/connector-transformations-plus-enrichment-api — the two-axis per-tenant customisation surface.
patterns/consolidate-database-and-search — the opposite- direction answer to the same database-and-search problem Datadog solved by splitting.