CONCEPT Cited by 1 source
Asynchronous replication¶
Definition¶
Asynchronous replication is a replication posture in which the primary (source) system acknowledges a write to the client before secondary (replica) systems have applied it. The secondaries catch up afterward via a replication stream (a WAL tail, binlog, CDC topic, etc.), and consumers of replicas observe eventual consistency: a bounded lag after the primary, not the same state.
Contrast with synchronous replication, which acknowledges a write only after all (or a quorum of) replicas have applied it — stronger consistency at the cost of write-path latency and availability coupling to replica health.
The trade-off, as Datadog framed it¶
Datadog's 2025-11-04 retrospective lays out the choice explicitly:
"Synchronous replication writes data to both the primary and replica systems at the same time, guaranteeing strong consistency — every write is acknowledged only after all replicas confirm receipt. This approach is ideal when real-time accuracy is critical, but it introduces significant latency and operational complexity, especially at scale and across distributed environments.
By contrast, asynchronous replication allows the primary system to acknowledge writes immediately, with data replicated to secondary systems afterward. This method is inherently more scalable and resilient in large-scale, high-throughput environments like Datadog's — it decouples application performance from network latency and replica responsiveness. While asynchronous replication can introduce minor data lag during failures, it enables robust, always-on data movement across thousands of services without bottlenecking on consistency guarantees." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
The trade-off, named flatly:
| Axis | Synchronous | Asynchronous |
|---|---|---|
| Consistency | Strong | Eventual (bounded lag) |
| Write-path latency | Coupled to slowest replica | Decoupled (primary acks immediately) |
| Availability | Coupled to replica health | Primary is independent |
| Scalability across distributed env | Poor (network RTT dominates) | Good |
| Failure mode | Write unavailable if replica unavailable | Bounded data lag during replica outage |
When to pick async¶
Datadog's stated priorities — "favouring scalability over strict consistency" — led them to pick async as the foundation of the entire managed replication platform, accepting minor data lag during failures as the bearable cost.
Async is the right default when:
- Replicas are read-optimised workloads that can tolerate bounded staleness (search indexes, analytics lakes, dashboards).
- The replica topology is cross-region or cross-cluster, making synchronous-write RTT unacceptable.
- Throughput at primary is the dominant optimisation target.
- The workload has a natural CDC / streaming shape — Postgres logical replication → Debezium → Kafka is an async pipeline by construction.
Numbers from Datadog¶
Replication lag on the Metrics Summary Postgres-to-search pipeline stayed around 500 ms — the canonical operating point for async CDC replication in Datadog's platform, good enough for a search experience where up-to-the-second consistency is not required.
Related¶
- concepts/change-data-capture — CDC is async by design; the replication stream is where the lag lives.
- concepts/eventual-consistency — the consistency model the primary pays for async replication in.
- concepts/strong-consistency — the trade-off partner.
- concepts/synchronization-tax — related cost class: async replication between a primary DB and a separate search engine introduces invalidation + reconciliation complexity; Datadog pays this tax in exchange for 97% page-latency reduction.
- concepts/wal-write-ahead-logging — Postgres WAL is the substrate under logical replication, which is the async replication primitive Debezium tails.
- systems/kafka — at-least-once async message transport is Kafka's default.
Seen in¶
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Datadog chose async replication as the foundation of their managed multi-tenant CDC platform; explicit priority statement ("favouring scalability over strict consistency"); quoted trade-off analysis above.