Skip to content

CONCEPT Cited by 1 source

Performance regression from mid-upgrade state

Definition

A performance regression from mid-upgrade state is a latency / throughput degradation observed during a rolling upgrade while the cluster is in a mixed-version state — and that resolves on its own once every node has been flipped to the new version.

The critical operational property: the regression is a signal of cluster state, not a bug to fix forward. The only durable remediation is to complete the upgrade.

Why this happens

During a mixed-version rolling upgrade:

  • Inter-node RPC falls back to a protocol subset both versions can negotiate — typically slower than either pure state.
  • Query routing is imbalanced — new-version nodes may be faster per request but there are fewer of them; the mix of fast and slow replicas per keyspace changes the tail.
  • Internal caches / prepared statements are cold on newly- restarted nodes for a window after each node comes back.
  • Ecosystem proxies may be running in two versioned fleets (see patterns/dual-run-version-specific-proxies) with per-version tail latencies that mix unhelpfully.

Why naming it matters

Without this concept, mid-upgrade latency spikes look like a regression in the new version — which invites the wrong response ("roll back") instead of the right response ("finish the upgrade"). Separating the two failure modes requires the operator to have the mental model that:

Mid-upgrade latency = cluster-state regression → finish the upgrade.

Post-upgrade latency = new-version regression → investigate and possibly roll back.

Seen in

  • sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Verbatim: "In some cases, we also observed elevated latency while the Cassandra cluster contained a mix of 3.11 and 4.1 nodes. This was transient and resolved once all nodes were upgraded." Yelp's benchmarking dashboards — established pre-migration — were the instrumentation that let them distinguish the transient mixed-version regression from the genuine Stargate 2.x regression on range/multi-partition queries (which did require rollback to Stargate 1.x).

Contrast — genuine version regression

The Yelp ingest also names a genuine post-upgrade regression: Stargate 2.x regressed on range and multi-partition queries, not a mid-upgrade artefact. That was identified by observing that the degradation persisted after all nodes were upgraded — the diagnostic test for telling the two apart.

Last updated · 476 distilled / 1,218 read