Skip to content

PATTERN Cited by 1 source

Dual-run version-specific proxies

Intent

During a major-version upgrade of a datastore whose data proxy is pinned to a specific major version of the datastore, run two parallel proxy fleets — one per major version — registered under the same service-mesh alias so clients see a single endpoint. Drain the old proxy fleet only after all-but-one datastore node have moved to the new version.

Problem

Some data proxies can't span a major version boundary because they fetch schema from the datastore at startup (or continuously), and the schema-fetch mechanism changed between versions. The canonical example is Stargate: the 3.11 Stargate cannot pull schema from a 4.1 Cassandra node because Cassandra 4.1's MigrationCoordinator behaves differently.

If there's only one proxy fleet, you're forced into one of:

  • Upgrade proxy first, datastore second — but then the proxy can't talk to any datastore nodes until the datastore is fully upgraded.
  • Upgrade datastore first, proxy second — but then the proxy can't talk to any datastore nodes once the first node flips.

Either direction breaks production traffic.

Solution

Run two proxy fleets simultaneously during the upgrade window:

  • Old-version proxy fleet — pinned to the old datastore major version; seed list points at an old-version node.
  • New-version proxy fleet — pinned to the new datastore major version; seed list points at a new-version node.
  • Single service-mesh alias fronts both fleets so clients see one endpoint; the mesh can route to either instance.

As the datastore upgrade progresses through the flight stage:

  1. Upgrade one datastore node to new version.
  2. Spin up the new-version proxy fleet — now both proxy fleets are running.
  3. Monitor per-fleet, per-keyspace p99 latency and error rate to catch regressions early.
  4. Upgrade remaining datastore nodes except the last one — the last one is deliberately held on the old version so the old-version proxy fleet's seed can still start.
  5. Drain the old-version proxy fleet. No more old-version proxies.
  6. Upgrade the last datastore node.

Structure

  Clients
  Service-mesh alias: "cassandra-gateway"
     ├────────► Stargate fleet v3.11  ──► seed = Cassandra-3.11 node
     └────────► Stargate fleet v4.1   ──► seed = Cassandra-4.1 node

  Cassandra cluster (rolling through flight stage):
    [3.11, 3.11, 3.11, 3.11, 3.11]       ← start
    [4.1,  3.11, 3.11, 3.11, 3.11]       ← one node flipped, new proxy spins up
    [4.1,  4.1,  4.1,  4.1,  3.11]       ← deliberately hold one 3.11 node
                                           drain v3.11 proxy fleet now
    [4.1,  4.1,  4.1,  4.1,  4.1 ]       ← all flipped

Test-coverage implication

Running two proxy fleets simultaneously means clients can hit either path depending on mesh routing. Acceptance-test coverage must exercise both paths to catch API-surface deltas across the major versions. Yelp explicitly "expanded our acceptance test coverage across all services" during this upgrade for exactly this reason.

Trade-offs

  • Double proxy footprint during the upgrade window.
  • Mesh routing must be stable — clients must not see spurious identity differences between the two fleets.
  • Seed-list discipline is load-bearing — each fleet's seed pool must point at the matching datastore version.
  • The last-node-on-old-version gate is non-obvious and easy to miss when automating the upgrade.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's Cassandra 3.11 → 4.1 upgrade across > 1,000 nodes. Direct quote: "Ultimately, we opted for version-specific Stargate instances, each relying on the corresponding version of the Cassandra persistence layer. During this process, we ensured that the seed list of the proxy always pointed to a Cassandra node running the matching major version." The kept-on-old-version last node gate is named explicitly in the flight-stage sequence diagram.
Last updated · 476 distilled / 1,218 read