CONCEPT Cited by 1 source
Asynchronous replication¶
Definition¶
Asynchronous replication is a replication posture in which the primary (source) system acknowledges a write to the client before secondary (replica) systems have applied it. The secondaries catch up afterward via a replication stream (a WAL tail, binlog, CDC topic, etc.), and consumers of replicas observe eventual consistency: a bounded lag after the primary, not the same state.
Contrast with synchronous replication, which acknowledges a write only after all (or a quorum of) replicas have applied it — stronger consistency at the cost of write-path latency and availability coupling to replica health.
The trade-off, as Datadog framed it¶
Datadog's 2025-11-04 retrospective lays out the choice explicitly:
"Synchronous replication writes data to both the primary and replica systems at the same time, guaranteeing strong consistency — every write is acknowledged only after all replicas confirm receipt. This approach is ideal when real-time accuracy is critical, but it introduces significant latency and operational complexity, especially at scale and across distributed environments.
By contrast, asynchronous replication allows the primary system to acknowledge writes immediately, with data replicated to secondary systems afterward. This method is inherently more scalable and resilient in large-scale, high-throughput environments like Datadog's — it decouples application performance from network latency and replica responsiveness. While asynchronous replication can introduce minor data lag during failures, it enables robust, always-on data movement across thousands of services without bottlenecking on consistency guarantees." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
The trade-off, named flatly:
| Axis | Synchronous | Asynchronous |
|---|---|---|
| Consistency | Strong | Eventual (bounded lag) |
| Write-path latency | Coupled to slowest replica | Decoupled (primary acks immediately) |
| Availability | Coupled to replica health | Primary is independent |
| Scalability across distributed env | Poor (network RTT dominates) | Good |
| Failure mode | Write unavailable if replica unavailable | Bounded data lag during replica outage |
When to pick async¶
Datadog's stated priorities — "favouring scalability over strict consistency" — led them to pick async as the foundation of the entire managed replication platform, accepting minor data lag during failures as the bearable cost.
Async is the right default when:
- Replicas are read-optimised workloads that can tolerate bounded staleness (search indexes, analytics lakes, dashboards).
- The replica topology is cross-region or cross-cluster, making synchronous-write RTT unacceptable.
- Throughput at primary is the dominant optimisation target.
- The workload has a natural CDC / streaming shape — Postgres logical replication → Debezium → Kafka is an async pipeline by construction.
Numbers from Datadog¶
Replication lag on the Metrics Summary Postgres-to-search pipeline stayed around 500 ms — the canonical operating point for async CDC replication in Datadog's platform, good enough for a search experience where up-to-the-second consistency is not required.
Related¶
- concepts/change-data-capture — CDC is async by design; the replication stream is where the lag lives.
- concepts/eventual-consistency — the consistency model the primary pays for async replication in.
- concepts/strong-consistency — the trade-off partner.
- concepts/synchronization-tax — related cost class: async replication between a primary DB and a separate search engine introduces invalidation + reconciliation complexity; Datadog pays this tax in exchange for 97% page-latency reduction.
- concepts/wal-write-ahead-logging — Postgres WAL is the substrate under logical replication, which is the async replication primitive Debezium tails.
- systems/kafka — at-least-once async message transport is Kafka's default.
Seen in¶
-
— Brian Morrison II (PlanetScale, 2023-11-15) canonicalises async as MySQL's default mode and the canonical operational rule that async is required for cross-region replication because semi-sync's per-write RTT cost is prohibitive at 60ms+ inter-region latency: "replicating across regions should be done in asynchronous mode so as to not cause unnecessary delay for the application making requests." Cross-AZ latency (single-digit ms per AWS docs) keeps semi-sync viable within region; cross-region latency (60ms+ per cloudping.co) forces async. Canonicalised as patterns/async-replication-for-cross-region. Also canonicalises the no-validation property of async: "transactions will be sent to the source and then read by each replica and processed independently. There is no validation from the source that any replica in the environment processes the transaction." Commit-ack on the primary means nothing about replica state; the primary's acknowledged transactions live only on the primary until the replica catches up.
-
sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Datadog chose async replication as the foundation of their managed multi-tenant CDC platform; explicit priority statement ("favouring scalability over strict consistency"); quoted trade-off analysis above.