PATTERN Cited by 1 source

post-flight upgrade stages¶

Intent¶

Decompose a datastore / fleet upgrade into three explicit stages — pre-flight, flight, post-flight — each with a well-defined state-transition contract. This decomposition makes the upgrade scriptable end-to-end, bounds the reversible vs. committed operations, and provides natural checkpoint boundaries for human confirmation.

Problem¶

An upgrade runbook written as a flat sequence of steps mixes:

Reversible setup (pausing repairs, verifying backups).
Cluster-state-changing operations (node restarts, version flips).
Reversible cleanup (re-enabling repairs, notifying stakeholders).

Mixing these makes it unclear which steps are safe to retry, which are committing, and where a failed upgrade should resume from. It also makes checkpointing hard — is this the reversible setup phase, or have we crossed the commit boundary?

Solution¶

Three stages, each a distinct contract:

Pre-flight — reversible setup, all-or-nothing gate¶

Prepare the cluster for upgrade. Everything here is reversible and cheap; the stage is a gate — if any step fails, abort the upgrade with no cluster-state commitment.

Typical pre-flight steps (from Yelp's canonical instance):

Communicate with relevant stakeholders.
Ensure schema versions are fully in agreement across the cluster (see concepts/schema-disagreement).
Disable user-initiated schema changes for the duration.
Verify a full backup exists.
Pause anti-entropy repairs.

Flight — the upgrade itself, one unit at a time¶

The committing stage. State changes. The mixed-version state is traversed here. Each unit-sized step is small and reversible individually, but the fleet as a whole is mid-upgrade until the stage completes.

Typical flight substructure (Yelp Cassandra):

Upgrade one node at a time within one data center.
Introduce the new-version-compatible proxy alongside the old once one node is upgraded.
Keep the last old-version node on the old version until the old-version proxy pool can be drained.
Stop the old-version proxy pool.
Upgrade the last node to complete the stage.
Repeat per data center in sequence.

Post-flight — reversible cleanup, return to steady-state¶

Restore the pre-upgrade disciplines on the now-homogeneous cluster. Reversible / re-runnable.

Typical post-flight steps:

Re-enable anti-entropy repairs.
Re-enable user-initiated schema changes.
Notify stakeholders.

Structure¶

PRE-FLIGHT  (gate)          FLIGHT (commit)            POST-FLIGHT (restore)
────────────             ─────────────────────       ─────────────────────
pause repairs          → upgrade node 1         →   re-enable repairs
disable schema         → spin up new proxy      →   re-enable schema
verify backup          → upgrade remaining -1
schema agreement       → drain old proxy
communicate            → upgrade last node
                                                    notify stakeholders
    [reversible]            [committing]               [reversible]

Why this decomposition helps¶

Checkpoint boundaries are natural. Between pre-flight and flight is the last reversible point. Between flight and post-flight is "the cluster is on the new version."
Scripting is easier. Each stage has a uniform entry / exit contract; the stages can be separate entry points for the script.
Human-confirmation dial maps cleanly. Auto-proceed is safe in pre-flight (reversible) and post-flight (reversible). Flight is where per-step confirmation earns its keep — see concepts/checkpointed-automation-script.
Reusable across datastores. Same structure applies to other gossip-based or Paxos-based datastore upgrades, not just Cassandra.

Seen in¶

sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's > 1,000-node Cassandra 3.11 → 4.1 upgrade. Verbatim stage list:
"The pre-flight stage prepares the Cassandra cluster for the upgrade" — 5 steps listed.
"The actual upgrade is carried out during the flight stage, one DC at a time, in sequence" — 6 flight steps covering node rolls + Stargate fleet fan-in-fan-out.
"The post-flight operation involved re-enabling anti-entropy repairs ... [and] allowing user-initiated schema changes on the cluster."

concepts/rolling-upgrade — the upgrade idiom this pattern disciplines.
concepts/checkpointed-automation-script — the per-stage execution mode.
concepts/anti-entropy-repair-pause — canonical pre-flight / post-flight bookend.
concepts/mixed-version-cluster — the cluster state of the flight stage.
systems/apache-cassandra — canonical instance.