Skip to content

CONCEPT Cited by 1 source

Anti-entropy repair pause

Definition

Pausing anti-entropy repairs during a rolling upgrade is a pre-flight discipline specific to datastores with an explicit repair / reconciliation pass (most prominently Cassandra's nodetool repair, which uses Merkle trees to reconcile replica state). The repair process is paused for the duration of the upgrade and re-enabled in post-flight.

Why pause

Two reasons stack:

  1. Repair + mixed-version traffic is untested ground. Anti-entropy traffic during a mixed-version cluster has to flow across nodes speaking slightly different internal protocols. Pausing repair avoids amplifying load + protocol surface during the window the cluster is already unusual.
  2. Repairs move substantial data (streaming repair of replica deltas). During an upgrade the fleet is already under extra load (restarts, re-gossip, init-container image pulls) — concurrent repair just adds contention to a topology that's already in flux.

Pre-flight / post-flight bookends

The discipline is canonically a two-step bookend:

  • Pre-flight: pause anti-entropy repairs for the duration of the upgrade.
  • Post-flight: re-enable anti-entropy repairs on the now-homogeneous fleet.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's pre-flight list includes "Ensuring that anti-entropy repairs (processes that synchronize data across all nodes in a Cassandra cluster to maintain consistency) for the cluster remain paused during the upgrade"; post-flight includes "Re-enabling anti-entropy repairs on the cluster." Additional Cassandra 4.1 win documented: the CASSANDRA-9143 fix restores the usability of incremental repairs that were broken under 3.11 — so post-upgrade the repair workload itself becomes cheaper.
Last updated · 476 distilled / 1,218 read