PATTERN Cited by 1 source
Pre-flight / flight / post-flight upgrade stages¶
Intent¶
Decompose a datastore / fleet upgrade into three explicit stages — pre-flight, flight, post-flight — each with a well-defined state-transition contract. This decomposition makes the upgrade scriptable end-to-end, bounds the reversible vs. committed operations, and provides natural checkpoint boundaries for human confirmation.
Problem¶
An upgrade runbook written as a flat sequence of steps mixes:
- Reversible setup (pausing repairs, verifying backups).
- Cluster-state-changing operations (node restarts, version flips).
- Reversible cleanup (re-enabling repairs, notifying stakeholders).
Mixing these makes it unclear which steps are safe to retry, which are committing, and where a failed upgrade should resume from. It also makes checkpointing hard — is this the reversible setup phase, or have we crossed the commit boundary?
Solution¶
Three stages, each a distinct contract:
Pre-flight — reversible setup, all-or-nothing gate¶
Prepare the cluster for upgrade. Everything here is reversible and cheap; the stage is a gate — if any step fails, abort the upgrade with no cluster-state commitment.
Typical pre-flight steps (from Yelp's canonical instance):
- Communicate with relevant stakeholders.
- Ensure schema versions are fully in agreement across the cluster (see concepts/schema-disagreement).
- Disable user-initiated schema changes for the duration.
- Verify a full backup exists.
- Pause anti-entropy repairs.
Flight — the upgrade itself, one unit at a time¶
The committing stage. State changes. The mixed-version state is traversed here. Each unit-sized step is small and reversible individually, but the fleet as a whole is mid-upgrade until the stage completes.
Typical flight substructure (Yelp Cassandra):
- Upgrade one node at a time within one data center.
- Introduce the new-version-compatible proxy alongside the old once one node is upgraded.
- Keep the last old-version node on the old version until the old-version proxy pool can be drained.
- Stop the old-version proxy pool.
- Upgrade the last node to complete the stage.
- Repeat per data center in sequence.
Post-flight — reversible cleanup, return to steady-state¶
Restore the pre-upgrade disciplines on the now-homogeneous cluster. Reversible / re-runnable.
Typical post-flight steps:
- Re-enable anti-entropy repairs.
- Re-enable user-initiated schema changes.
- Notify stakeholders.
Structure¶
PRE-FLIGHT (gate) FLIGHT (commit) POST-FLIGHT (restore)
──────────── ───────────────────── ─────────────────────
pause repairs → upgrade node 1 → re-enable repairs
disable schema → spin up new proxy → re-enable schema
verify backup → upgrade remaining -1
schema agreement → drain old proxy
communicate → upgrade last node
notify stakeholders
[reversible] [committing] [reversible]
Why this decomposition helps¶
- Checkpoint boundaries are natural. Between pre-flight and flight is the last reversible point. Between flight and post-flight is "the cluster is on the new version."
- Scripting is easier. Each stage has a uniform entry / exit contract; the stages can be separate entry points for the script.
- Human-confirmation dial maps cleanly. Auto-proceed is safe in pre-flight (reversible) and post-flight (reversible). Flight is where per-step confirmation earns its keep — see concepts/checkpointed-automation-script.
- Reusable across datastores. Same structure applies to other gossip-based or Paxos-based datastore upgrades, not just Cassandra.
Seen in¶
- sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's > 1,000-node Cassandra 3.11 → 4.1 upgrade. Verbatim stage list:
- "The pre-flight stage prepares the Cassandra cluster for the upgrade" — 5 steps listed.
- "The actual upgrade is carried out during the flight stage, one DC at a time, in sequence" — 6 flight steps covering node rolls + Stargate fleet fan-in-fan-out.
- "The post-flight operation involved re-enabling anti-entropy repairs ... [and] allowing user-initiated schema changes on the cluster."
Related¶
- concepts/rolling-upgrade — the upgrade idiom this pattern disciplines.
- concepts/checkpointed-automation-script — the per-stage execution mode.
- concepts/anti-entropy-repair-pause — canonical pre-flight / post-flight bookend.
- concepts/mixed-version-cluster — the cluster state of the flight stage.
- systems/apache-cassandra — canonical instance.