Skip to content

PATTERN Cited by 1 source

Production qualification criteria upfront

Intent

Before a fleet-wide upgrade / migration begins, write down the criteria the post-upgrade state must meet — latency, throughput, functional equivalence, security posture, rollback readiness, observability sufficiency — and treat the criteria as a go / no-go gate for each cluster.

Problem

Without upfront criteria, the go/no-go decision gets made under pressure during the upgrade. The shape of the argument becomes:

"Latency is 7% higher than before, but that's probably within noise, right? We're 70% through the fleet — do we really want to roll back now?"

Under time pressure, each individual argument to proceed looks reasonable; the cumulative effect is a silent erosion of what "successful upgrade" meant when the project was approved.

Solution

Write the criteria down once, up front, make them specific and measurable, and evaluate every cluster against them before moving to the next.

Yelp's Cassandra 4.x upgrade canonicalises six criteria:

  1. Performance — no degradation in latency, throughput, uptime, resource utilization, or corresponding SLOs.
  2. Functional regressions — no breaking API changes or bugs (caught by expanded acceptance-test coverage across both major versions).
  3. Security posture — intact during and after the upgrade.
  4. Deployment + rollback plans — comprehensive plans must exist before the upgrade starts.
  5. Observability sufficient to track the progress of upgrades (see concepts/observability-before-migration).
  6. All interacting components operational — every component that talks to the upgraded datastore must remain fully operational, reliable, and functionally correct.

Verbatim from the Yelp post: "We developed the following production qualification criteria."

Structure

The criteria are authored as a checklist:

Qualification criteria (Cassandra 4.x)
[ ] Performance: p99/p50/throughput within ±5% of pre-upgrade
[ ] Functional: no failing acceptance tests cross-version
[ ] Security: no regression in cert rotation, auth, TLS
[ ] Deployment plan + rollback plan reviewed and signed off
[ ] Observability: per-version dashboards + alerts exist
[ ] Components: Stargate, CDC, Sink, Spark Connector, Pushplan
                each pass their own sub-check

Per-cluster, the checklist is run both pre-upgrade (is the setup ready?) and post-upgrade (did we actually meet the bar?) — and only then is the cluster marked done.

Why upfront matters

  • Decouples criteria-authoring from criteria-evaluation. Authored calmly, evaluated under pressure.
  • Keeps the bar visible — everyone looking at the upgrade sees the same success definition.
  • Forces the rollback plan to exist before it's needed. "Comprehensive deployment and rollback plans must be in place" is a criterion, not a wish.
  • Binds observability to completion. If you can't observe it, you can't say it met the criteria — forces instrumentation to land before the upgrade does.

Seen in

Last updated · 476 distilled / 1,218 read