Skip to content

PATTERN Cited by 1 source

Pre-flight / flight / post-flight upgrade stages

Intent

Decompose a datastore / fleet upgrade into three explicit stages — pre-flight, flight, post-flight — each with a well-defined state-transition contract. This decomposition makes the upgrade scriptable end-to-end, bounds the reversible vs. committed operations, and provides natural checkpoint boundaries for human confirmation.

Problem

An upgrade runbook written as a flat sequence of steps mixes:

  • Reversible setup (pausing repairs, verifying backups).
  • Cluster-state-changing operations (node restarts, version flips).
  • Reversible cleanup (re-enabling repairs, notifying stakeholders).

Mixing these makes it unclear which steps are safe to retry, which are committing, and where a failed upgrade should resume from. It also makes checkpointing hard — is this the reversible setup phase, or have we crossed the commit boundary?

Solution

Three stages, each a distinct contract:

Pre-flight — reversible setup, all-or-nothing gate

Prepare the cluster for upgrade. Everything here is reversible and cheap; the stage is a gate — if any step fails, abort the upgrade with no cluster-state commitment.

Typical pre-flight steps (from Yelp's canonical instance):

  • Communicate with relevant stakeholders.
  • Ensure schema versions are fully in agreement across the cluster (see concepts/schema-disagreement).
  • Disable user-initiated schema changes for the duration.
  • Verify a full backup exists.
  • Pause anti-entropy repairs.

Flight — the upgrade itself, one unit at a time

The committing stage. State changes. The mixed-version state is traversed here. Each unit-sized step is small and reversible individually, but the fleet as a whole is mid-upgrade until the stage completes.

Typical flight substructure (Yelp Cassandra):

  • Upgrade one node at a time within one data center.
  • Introduce the new-version-compatible proxy alongside the old once one node is upgraded.
  • Keep the last old-version node on the old version until the old-version proxy pool can be drained.
  • Stop the old-version proxy pool.
  • Upgrade the last node to complete the stage.
  • Repeat per data center in sequence.

Post-flight — reversible cleanup, return to steady-state

Restore the pre-upgrade disciplines on the now-homogeneous cluster. Reversible / re-runnable.

Typical post-flight steps:

  • Re-enable anti-entropy repairs.
  • Re-enable user-initiated schema changes.
  • Notify stakeholders.

Structure

PRE-FLIGHT  (gate)          FLIGHT (commit)            POST-FLIGHT (restore)
────────────             ─────────────────────       ─────────────────────
pause repairs          → upgrade node 1         →   re-enable repairs
disable schema         → spin up new proxy      →   re-enable schema
verify backup          → upgrade remaining -1
schema agreement       → drain old proxy
communicate            → upgrade last node
                                                    notify stakeholders
    [reversible]            [committing]               [reversible]

Why this decomposition helps

  • Checkpoint boundaries are natural. Between pre-flight and flight is the last reversible point. Between flight and post-flight is "the cluster is on the new version."
  • Scripting is easier. Each stage has a uniform entry / exit contract; the stages can be separate entry points for the script.
  • Human-confirmation dial maps cleanly. Auto-proceed is safe in pre-flight (reversible) and post-flight (reversible). Flight is where per-step confirmation earns its keep — see concepts/checkpointed-automation-script.
  • Reusable across datastores. Same structure applies to other gossip-based or Paxos-based datastore upgrades, not just Cassandra.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's > 1,000-node Cassandra 3.11 → 4.1 upgrade. Verbatim stage list:
  • "The pre-flight stage prepares the Cassandra cluster for the upgrade" — 5 steps listed.
  • "The actual upgrade is carried out during the flight stage, one DC at a time, in sequence" — 6 flight steps covering node rolls + Stargate fleet fan-in-fan-out.
  • "The post-flight operation involved re-enabling anti-entropy repairs ... [and] allowing user-initiated schema changes on the cluster."
Last updated · 476 distilled / 1,218 read