CONCEPT Cited by 1 source

WAL replication¶

WAL replication is the technique of shipping a primary's write- ahead log (WAL) to one or more replicas and having the replicas apply it to produce an identical logical state. It is the replication substrate under HBase primary-standby inter-cluster replication, MySQL binlog replication, PostgreSQL streaming replication, and many others.

Definition¶

Every mutation on the primary is first appended to the WAL (for local crash recovery — see concepts/wal-write-ahead-logging). The same WAL record is then streamed to a replica, which replays it to apply the mutation. Because the WAL is the authoritative order-of-operations log, replicas that apply it deterministically converge on the same state.

Two common variants:

Intra-cluster WAL replication — replicas inside a single cluster share the same WAL stream for high availability (e.g. MySQL binlog replication between primary and read-replicas).
Inter-cluster WAL replication — the WAL stream crosses a cluster boundary for disaster recovery or geographic distribution. The canonical Pinterest HBase shape (patterns/primary-standby-wal-replication) uses this — primary cluster and standby cluster each have their own intra-cluster replication but are kept in sync with each other via WAL shipping (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest).

Why it works¶

Same order on every replica. The WAL is a totally-ordered per-primary log; replaying it in order gives a deterministic result given deterministic operators.
Minimal overhead on the primary. The primary already wrote the WAL for its own durability; sending it to the replica is mostly network work.
Natural point-in-time-recovery primitive. The WAL can be archived for backup and replayed to reconstruct state at any timestamp.

Tradeoffs¶

Replicas can lag. If the WAL shipping channel is slow or the replica is overloaded, the replica drifts behind the primary — the root of most replication lag incidents.
Bad writes are replicated faithfully. WAL replication is a physical replay — including corrupt writes. It does not protect against application-level data corruption.
Large transactions can create replay hotspots. A single big transaction on the primary arrives as a single batch at the replica, which has to apply it atomically.

Seen in¶

sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical inter-cluster WAL replication instance on the wiki: Pinterest's HBase primary + standby clusters are kept in sync via WAL shipping between cluster boundaries; each cluster is also three-way replicated internally. This is the mechanism behind the 6-replica-per-record cost structure (see concepts/replica-cost-tradeoff).

concepts/wal-write-ahead-logging — the local-durability primitive this is layered on.
concepts/binlog-replication — MySQL's concrete instantiation.
concepts/logical-replication — PostgreSQL's decoded-WAL variant.
concepts/primary-standby-failover — the failover model that rides on WAL replication.
concepts/replication-lag — the observability dimension that matters most for WAL replication health.
patterns/primary-standby-wal-replication — the two-cluster deployment pattern this concept underpins.
systems/hbase — the canonical substrate at Pinterest.

WAL replication¶

Definition¶

Why it works¶

Tradeoffs¶

Seen in¶

Related¶