CONCEPT Cited by 1 source
Performance regression from mid-upgrade state¶
Definition¶
A performance regression from mid-upgrade state is a latency / throughput degradation observed during a rolling upgrade while the cluster is in a mixed-version state — and that resolves on its own once every node has been flipped to the new version.
The critical operational property: the regression is a signal of cluster state, not a bug to fix forward. The only durable remediation is to complete the upgrade.
Why this happens¶
During a mixed-version rolling upgrade:
- Inter-node RPC falls back to a protocol subset both versions can negotiate — typically slower than either pure state.
- Query routing is imbalanced — new-version nodes may be faster per request but there are fewer of them; the mix of fast and slow replicas per keyspace changes the tail.
- Internal caches / prepared statements are cold on newly- restarted nodes for a window after each node comes back.
- Ecosystem proxies may be running in two versioned fleets (see patterns/dual-run-version-specific-proxies) with per-version tail latencies that mix unhelpfully.
Why naming it matters¶
Without this concept, mid-upgrade latency spikes look like a regression in the new version — which invites the wrong response ("roll back") instead of the right response ("finish the upgrade"). Separating the two failure modes requires the operator to have the mental model that:
Mid-upgrade latency = cluster-state regression → finish the upgrade.
Post-upgrade latency = new-version regression → investigate and possibly roll back.
Seen in¶
- sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — canonical wiki Seen-in. Verbatim: "In some cases, we also observed elevated latency while the Cassandra cluster contained a mix of 3.11 and 4.1 nodes. This was transient and resolved once all nodes were upgraded." Yelp's benchmarking dashboards — established pre-migration — were the instrumentation that let them distinguish the transient mixed-version regression from the genuine Stargate 2.x regression on range/multi-partition queries (which did require rollback to Stargate 1.x).
Contrast — genuine version regression¶
The Yelp ingest also names a genuine post-upgrade regression: Stargate 2.x regressed on range and multi-partition queries, not a mid-upgrade artefact. That was identified by observing that the degradation persisted after all nodes were upgraded — the diagnostic test for telling the two apart.
Related¶
- concepts/mixed-version-cluster — the broader cluster state this regression is a property of.
- concepts/rolling-upgrade — the upgrade idiom.
- concepts/observability-before-migration — the dashboard discipline required to tell transient from genuine regressions.
- systems/apache-cassandra — canonical user context.