Skip to content

CONCEPT Cited by 1 source

Asynchronous replication

Definition

Asynchronous replication is a replication posture in which the primary (source) system acknowledges a write to the client before secondary (replica) systems have applied it. The secondaries catch up afterward via a replication stream (a WAL tail, binlog, CDC topic, etc.), and consumers of replicas observe eventual consistency: a bounded lag after the primary, not the same state.

Contrast with synchronous replication, which acknowledges a write only after all (or a quorum of) replicas have applied it — stronger consistency at the cost of write-path latency and availability coupling to replica health.

The trade-off, as Datadog framed it

Datadog's 2025-11-04 retrospective lays out the choice explicitly:

"Synchronous replication writes data to both the primary and replica systems at the same time, guaranteeing strong consistency — every write is acknowledged only after all replicas confirm receipt. This approach is ideal when real-time accuracy is critical, but it introduces significant latency and operational complexity, especially at scale and across distributed environments.

By contrast, asynchronous replication allows the primary system to acknowledge writes immediately, with data replicated to secondary systems afterward. This method is inherently more scalable and resilient in large-scale, high-throughput environments like Datadog's — it decouples application performance from network latency and replica responsiveness. While asynchronous replication can introduce minor data lag during failures, it enables robust, always-on data movement across thousands of services without bottlenecking on consistency guarantees." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

The trade-off, named flatly:

Axis Synchronous Asynchronous
Consistency Strong Eventual (bounded lag)
Write-path latency Coupled to slowest replica Decoupled (primary acks immediately)
Availability Coupled to replica health Primary is independent
Scalability across distributed env Poor (network RTT dominates) Good
Failure mode Write unavailable if replica unavailable Bounded data lag during replica outage

When to pick async

Datadog's stated priorities — "favouring scalability over strict consistency" — led them to pick async as the foundation of the entire managed replication platform, accepting minor data lag during failures as the bearable cost.

Async is the right default when:

  • Replicas are read-optimised workloads that can tolerate bounded staleness (search indexes, analytics lakes, dashboards).
  • The replica topology is cross-region or cross-cluster, making synchronous-write RTT unacceptable.
  • Throughput at primary is the dominant optimisation target.
  • The workload has a natural CDC / streaming shape — Postgres logical replication → Debezium → Kafka is an async pipeline by construction.

Numbers from Datadog

Replication lag on the Metrics Summary Postgres-to-search pipeline stayed around 500 ms — the canonical operating point for async CDC replication in Datadog's platform, good enough for a search experience where up-to-the-second consistency is not required.

Seen in

Last updated · 200 distilled / 1,178 read