Skip to content

PATTERN Cited by 1 source

Version-specific images per Git branch

Intent

During a major version migration of a fleet-wide system, ship both old and new major versions as parallel, independently deployable artifacts by publishing version-specific images from dedicated Git branches, and select which image each instance runs at bootstrap via an environment variable.

This lets the migration proceed cluster-by-cluster, pause indefinitely at any point, and roll clusters forward or back independently of client code or of each other.

Problem

A major-version bump of a core system (datastore, service framework, runtime) creates a hard block if there is only one current image:

  • Every cluster must be upgraded before any new feature needs new-version support.
  • Rollback reverts the whole fleet.
  • Client teams are dragged into the migration window because their dependency version floats with the platform.

Adding a dependency on client teams "increases project complexity, potentially leading to a long tail of migrations" (Yelp's phrasing, Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade).

Solution

  • Maintain a Git branch per major version (e.g. cassandra-3.11, cassandra-4.1).
  • Apply fixes to the appropriate branch(es); during the upgrade window, any 3.11 hotfix must also be ported to 4.1.
  • Publish per-branch Docker images (or equivalent artifact).
  • Select the image per cluster (or per instance) at bootstrap via a version-specific environment variable read by the deploy tooling.

The migration now proceeds per cluster:

  1. Flip the env var on cluster N to point at the new version.
  2. Deploy restarts cluster N's nodes one at a time with the new image.
  3. Cluster N+1 stays on old version until its turn.

Rollback is symmetric: flip the env var back, restart.

Structure

Git repo
├── branch: cassandra-3.11
│   └── Dockerfile builds cassandra-3.11:latest image
├── branch: cassandra-4.1
│   └── Dockerfile builds cassandra-4.1:latest image
Cluster configuration
└── environment variable CASSANDRA_VERSION_TAG=cassandra-3.11 | cassandra-4.1
Deploy tooling
└── pulls $CASSANDRA_VERSION_TAG at bootstrap time

Trade-offs

  • Critical-fix duplication: during the upgrade window, any fix to the old branch must also land on the new branch. Yelp's rationale: "critical fixes for Cassandra 3.11 [were] expected to be rare" during the window, so the overhead was acceptable.
  • Artifact sprawl: more images to build, test, and store. Mitigate by keeping the window bounded.
  • Testing matrix doubles across client combinations for the duration of the upgrade.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp ran both Cassandra 3.11 and 4.1 images side by side across their > 1,000-node fleet during the upgrade window. Direct quote: "we achieved this by publishing version-specific Cassandra images from dedicated Git branches. The appropriate Cassandra image was selected at bootstrap time via version-specific environment variables." Avoiding "hard-blocking ourselves during the upgrade" was an explicit core principle.
Last updated · 476 distilled / 1,218 read