PATTERN Cited by 1 source

Dual-run version-specific proxies¶

Intent¶

During a major-version upgrade of a datastore whose data proxy is pinned to a specific major version of the datastore, run two parallel proxy fleets — one per major version — registered under the same service-mesh alias so clients see a single endpoint. Drain the old proxy fleet only after all-but-one datastore node have moved to the new version.

Problem¶

Some data proxies can't span a major version boundary because they fetch schema from the datastore at startup (or continuously), and the schema-fetch mechanism changed between versions. The canonical example is Stargate: the 3.11 Stargate cannot pull schema from a 4.1 Cassandra node because Cassandra 4.1's MigrationCoordinator behaves differently.

If there's only one proxy fleet, you're forced into one of:

Upgrade proxy first, datastore second — but then the proxy can't talk to any datastore nodes until the datastore is fully upgraded.
Upgrade datastore first, proxy second — but then the proxy can't talk to any datastore nodes once the first node flips.

Either direction breaks production traffic.

Solution¶

Run two proxy fleets simultaneously during the upgrade window:

Old-version proxy fleet — pinned to the old datastore major version; seed list points at an old-version node.
New-version proxy fleet — pinned to the new datastore major version; seed list points at a new-version node.
Single service-mesh alias fronts both fleets so clients see one endpoint; the mesh can route to either instance.

As the datastore upgrade progresses through the flight stage:

Upgrade one datastore node to new version.
Spin up the new-version proxy fleet — now both proxy fleets are running.
Monitor per-fleet, per-keyspace p99 latency and error rate to catch regressions early.
Upgrade remaining datastore nodes except the last one — the last one is deliberately held on the old version so the old-version proxy fleet's seed can still start.
Drain the old-version proxy fleet. No more old-version proxies.
Upgrade the last datastore node.

Structure¶

  Clients
     │
     ▼
  Service-mesh alias: "cassandra-gateway"
     │
     ├────────► Stargate fleet v3.11  ──► seed = Cassandra-3.11 node
     └────────► Stargate fleet v4.1   ──► seed = Cassandra-4.1 node

  Cassandra cluster (rolling through flight stage):
    [3.11, 3.11, 3.11, 3.11, 3.11]       ← start
    [4.1,  3.11, 3.11, 3.11, 3.11]       ← one node flipped, new proxy spins up
    [4.1,  4.1,  4.1,  4.1,  3.11]       ← deliberately hold one 3.11 node
                                           drain v3.11 proxy fleet now
    [4.1,  4.1,  4.1,  4.1,  4.1 ]       ← all flipped

Test-coverage implication¶

Running two proxy fleets simultaneously means clients can hit either path depending on mesh routing. Acceptance-test coverage must exercise both paths to catch API-surface deltas across the major versions. Yelp explicitly "expanded our acceptance test coverage across all services" during this upgrade for exactly this reason.

Trade-offs¶

Double proxy footprint during the upgrade window.
Mesh routing must be stable — clients must not see spurious identity differences between the two fleets.
Seed-list discipline is load-bearing — each fleet's seed pool must point at the matching datastore version.
The last-node-on-old-version gate is non-obvious and easy to miss when automating the upgrade.

Seen in¶

sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Yelp's Cassandra 3.11 → 4.1 upgrade across > 1,000 nodes. Direct quote: "Ultimately, we opted for version-specific Stargate instances, each relying on the corresponding version of the Cassandra persistence layer. During this process, we ensured that the seed list of the proxy always pointed to a Cassandra node running the matching major version." The kept-on-old-version last node gate is named explicitly in the flight-stage sequence diagram.

systems/stargate-cassandra-proxy — canonical proxy.
systems/apache-cassandra — canonical datastore.
concepts/mixed-version-cluster — the cluster state this pattern extends across the proxy tier.
concepts/rolling-upgrade — the upgrade idiom.
patterns/pre-flight-flight-post-flight-upgrade-stages — the flight stage this pattern is inside.