CONCEPT Cited by 1 source
WAL replication¶
WAL replication is the technique of shipping a primary's write- ahead log (WAL) to one or more replicas and having the replicas apply it to produce an identical logical state. It is the replication substrate under HBase primary-standby inter-cluster replication, MySQL binlog replication, PostgreSQL streaming replication, and many others.
Definition¶
Every mutation on the primary is first appended to the WAL (for local crash recovery — see concepts/wal-write-ahead-logging). The same WAL record is then streamed to a replica, which replays it to apply the mutation. Because the WAL is the authoritative order-of-operations log, replicas that apply it deterministically converge on the same state.
Two common variants:
- Intra-cluster WAL replication — replicas inside a single cluster share the same WAL stream for high availability (e.g. MySQL binlog replication between primary and read-replicas).
- Inter-cluster WAL replication — the WAL stream crosses a cluster boundary for disaster recovery or geographic distribution. The canonical Pinterest HBase shape (patterns/primary-standby-wal-replication) uses this — primary cluster and standby cluster each have their own intra-cluster replication but are kept in sync with each other via WAL shipping (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest).
Why it works¶
- Same order on every replica. The WAL is a totally-ordered per-primary log; replaying it in order gives a deterministic result given deterministic operators.
- Minimal overhead on the primary. The primary already wrote the WAL for its own durability; sending it to the replica is mostly network work.
- Natural point-in-time-recovery primitive. The WAL can be archived for backup and replayed to reconstruct state at any timestamp.
Tradeoffs¶
- Replicas can lag. If the WAL shipping channel is slow or the replica is overloaded, the replica drifts behind the primary — the root of most replication lag incidents.
- Bad writes are replicated faithfully. WAL replication is a physical replay — including corrupt writes. It does not protect against application-level data corruption.
- Large transactions can create replay hotspots. A single big transaction on the primary arrives as a single batch at the replica, which has to apply it atomically.
Seen in¶
- sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical inter-cluster WAL replication instance on the wiki: Pinterest's HBase primary + standby clusters are kept in sync via WAL shipping between cluster boundaries; each cluster is also three-way replicated internally. This is the mechanism behind the 6-replica-per-record cost structure (see concepts/replica-cost-tradeoff).
Related¶
- concepts/wal-write-ahead-logging — the local-durability primitive this is layered on.
- concepts/binlog-replication — MySQL's concrete instantiation.
- concepts/logical-replication — PostgreSQL's decoded-WAL variant.
- concepts/primary-standby-failover — the failover model that rides on WAL replication.
- concepts/replication-lag — the observability dimension that matters most for WAL replication health.
- patterns/primary-standby-wal-replication — the two-cluster deployment pattern this concept underpins.
- systems/hbase — the canonical substrate at Pinterest.