YELP 2026-04-07 Tier 3

Yelp — Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story¶

Summary¶

Yelp Engineering post (2026-04-07) from the Database Reliability Engineering team on upgrading more than a thousand Cassandra nodes from 3.11 to 4.1 with zero downtime. The cluster fleet runs on Kubernetes via Yelp's Cassandra operator; the upgrade rolls through in place rather than via a new data center. Core structural move: version-specific Cassandra images published from dedicated Git branches, selected at bootstrap via environment variables, so 3.11 and 4.1 can coexist across the fleet and individual clusters can be rolled forward (or back) independently of client code. The upgrade of each cluster is driven by a checkpointed automation script that orchestrates pre-flight (communications, schema-agreement check, disable schema changes, verify backup, pause anti-entropy repairs), flight (one-node-at-a-time rolling upgrade with an interim 4.1-compatible Stargate proxy running alongside the 3.11 Stargate until the last node flips), and post-flight (re-enable repairs, re-enable schema changes, notify stakeholders). Load-bearing compatibility work on two ecosystem components: (1) Stargate needed version-specific instances pinned to matching Cassandra major versions because Cassandra 4.1's MigrationCoordinator broke cross-major schema pulls (CASSANDRA-19244-adjacent behaviour); (2) the in-house Cassandra Source Connector (CDC → Kafka via Yelp's data pipeline) was not forward-compatible due to CDC commit logs being emitted on mutation in 4.x (previously only on flush, CASSANDRA-12148) and the 4.1 codebase refactor; Yelp kept the DataPipeline Materializer sub-component backward-compatible and shipped it ahead of the upgrade, while the CDC Publisher was upgraded in lockstep with each node. Lessons canonicalised: a mixed-version cluster produces transient elevated latency that self-resolves once all nodes flip; a Stargate 2.x performance regression on range-queries / multi-partition queries forced Yelp to downgrade to Stargate 1.x — detected by their benchmarking dashboards in non-production; post-upgrade schema disagreement on CDC-enabled clusters resolved by making dummy schema changes from multiple nodes to force convergence. Reported wins: up to 58% p99 latency reduction on key clusters; faster gossip convergence + node-restart times via consistent smaller seed lists enabled by CASSANDRA-14190; Java 8 → 11; hot-reloadable SSL certs (CEP-9); usable incremental repairs (CASSANDRA-9143 fix); the guardrails framework; denylisting for noisy-neighbour partitions; path now open to Cassandra 5 (ACID transactions, vector search).

Key takeaways¶

Version-specific Cassandra images from dedicated Git branches, selected at bootstrap via environment variables, are the core "no hard block" lever. Yelp ships both 3.11 and 4.1 images and picks one per cluster at boot — "the appropriate Cassandra image was selected at bootstrap time via version- specific environment variables." The overhead of also shipping any 3.11 hotfix to the 4.1 branch was accepted because it preserved independent rollout + rollback per cluster, and "critical fixes for Cassandra 3.11 [were] expected to be rare" during the upgrade window. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
In-place rolling upgrade beat new-DC for this fleet. DataStax recommends rolling restart over "creating a new data center (DC) on 3.11, upgrading its nodes, and redirecting the traffic to the new DC." Yelp considered new-DC for EBS-right-sizing + standardising DC configs + easier rollback but rejected on three load-bearing grounds: weeks-long streaming time, EACH_QUORUM downgrade during dual-DC operation causing eventual-consistency windows, and 2× node cost for the duration. "We opted for an in-place upgrade to reduce time and cost." Canonical wiki contrast for in-place vs new-DC database upgrade. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Init containers resolve the "new IP + new version at once" gossip bootstrap problem. Because Cassandra pods get a new IP on restart, flipping version and IP simultaneously was breaking initial gossip communication (CASSANDRA-19244). Yelp's fix: a Kubernetes init container first starts the old 3.11 node on the new pod/IP, gossips the IP change into the ring, and only then does the container flip to 4.1. This sequences the two changes (IP + version) into two distinct gossip-observable events. Presented externally as "Upgrading Cassandra on Kubernetes" at KubeCon 2025. Canonical wiki instance of concepts/init-container-ip-gossip-pre-migration. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Version-specific Stargate instances fan out around the MigrationCoordinator schema-pull behaviour change. The 3.11 Stargate can't pull schema from a 4.1 node; Yelp ran two Stargate fleets simultaneously (3.11-persistence Stargate + 4.1-persistence Stargate), registered under the same service-mesh namespace so clients saw one endpoint. Each Stargate's seed list was "always pointed to a Cassandra node running the matching major version." During the flight stage the last 3.11 Cassandra node is deliberately kept so the 3.11 Stargate pool can still pull schema at startup; only after the 3.11 Stargates are drained is the last 3.11 node rolled to 4.1. Acceptance-test coverage was expanded to guard against breaking API changes across the two Stargate fleets. Canonical wiki instance of patterns/dual-run-version-specific-proxies. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
CDC commit-log semantics changed in 4.x — forward- incompatible for the Cassandra Source Connector. Pre-4.x, CDC commit logs were written on flush; in 4.x they're written on mutation (CASSANDRA-12148). This broke Yelp's Cassandra Source Connector — a two-component system (DataPipeline Materializer + CDC Publisher) that reads CDC commit logs and publishes into the Yelp data pipeline abstraction over Kafka. Yelp split the rollout: the Materializer was made backward-compatible with both 3.11 and 4.1 and shipped fleet-wide before any Cassandra upgrade started; the CDC Publisher — which runs as a separate container co-located with the Cassandra node — was upgraded in lockstep with each Cassandra node. They also switched the schema-update handling from the Cassandra driver's Schema Change Listener to actively detecting schema changes as commit logs are processed — "simplified the CDC Publisher" even though not strictly required. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Production qualification criteria fixed upfront — not invented mid-upgrade. Six named criteria: (1) no performance degradation (latency / throughput / uptime / resource / SLO); (2) no functional regression (API compat); (3) security posture preserved; (4) deployment + rollback plans exist; (5) observability sufficient to track progress; (6) every Cassandra-interacting component remains operational. Verbatim: "We developed the following production qualification criteria." Canonical wiki instance of patterns/production-qualification-criteria-upfront applied to a datastore upgrade. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Benchmark in your own environment before trusting published numbers. Yelp spun up identical-resource 3.11 and 4.1 clusters with their own production workload profiles, benchmarked, and measured "4% improvement in 99th percentile latencies and nearly 11% improvement in mean latency … [and] more than 11% improvement in request throughput" — consistent with the DataStax whitepaper but measured against their own data model and query mix. Final reported real-world gain was much larger (up to 58% p99 reduction on key clusters) — which is precisely why you benchmark in your own env. Canonical wiki instance of patterns/benchmark-in-own-environment-before-upgrade. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Automation with checkpoint / confirmation mode tuned risk to confidence. The upgrade driver is "a script that executes various kubectl and CLI commands, creates pull requests, and performs other workflow steps. The script can run in auto- proceed mode or pause for confirmation from an engineer after each step." Auto-proceed for low-risk clusters; per-step confirmation for critical clusters. Canonical wiki instance of concepts/checkpointed-automation-script. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Stargate 2.x regressed on range / multi-partition queries — caught in non-prod. "Some specific use cases such as range queries and multi-partition queries were found to be slower. After extensive debugging, we identified the performance regression as being introduced by Stargate 2.x. Downgrading to version 1.x resolved the issue." Detected in non-production via detailed observability dashboards — the kind of early signal that validates observability-before-migration at datastore scale. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Transient mixed-version latency resolved on its own once all nodes flipped. "In some cases, we also observed elevated latency while the Cassandra cluster contained a mix of 3.11 and 4.1 nodes. This was transient and resolved once all nodes were upgraded." Canonical wiki instance of concepts/performance-regression-from-mid-upgrade-state — the operational lesson: a mixed-version cluster is a real steady-state the operator must monitor, but the regression signature is the cluster state, not a fix-forward bug. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Post-upgrade schema disagreement on CDC-enabled clusters resolved by dummy multi-node schema changes. "On some Cassandra clusters with CDC enabled, we observed schema disagreement after all the nodes in the cluster were upgraded. While the root cause of this issue is not fully understood, we found that making dummy schema changes from multiple nodes after the upgrade led to gradual schema convergence. This approach served as an effective remediation." Recorded on the wiki at schema disagreement. (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)
Wins beyond latency. Faster + more stable node restarts via non-disruptive seed-list reload (CASSANDRA-14190) letting the cluster have "a consistent and smaller seed list within each Cassandra cluster" → faster gossip convergence on topology changes. Java 8 → 11. Hot-reloadable SSL certs via standard key-management (CEP-9). Usable incremental repairs (CASSANDRA-9143 fix). Denylisting partitions for noisy-neighbour mitigation. Guardrails framework for tunable warning + error thresholds. Post-upgrade path to Cassandra 5 (ACID transactions, vector search). (Source: sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade)

Operational numbers¶

> 1,000 Cassandra nodes upgraded (Yelp fleet-wide).
Cassandra 3.11 → 4.1 — single major upgrade; unblocks path to 5.
Java 8 → 11 — runtime upgrade bundled.
4% p99 latency improvement and ~11% mean-latency improvement in Yelp's own benchmarks (3.11 vs 4.1 on identical-resource clusters with production-like workload).
> 11% request-throughput improvement in the same benchmark.
Up to 58% p99 latency reduction measured in production on key clusters post-upgrade.
34% faster streaming operations (expected from public 4.1 benchmarks).
21–60% lower p99 latency (expected from public 4.1 benchmarks).
0 downtime, 0 incidents, 0 client-code changes — the seamless-upgrade principle met.

Architecture¶

Components Yelp had to make 4.1-compatible¶

Cassandra nodes — upgraded via init-container-sequenced rolling upgrade.
Stargate — open-source token-aware low-latency proxy that generates a GraphQL schema per keyspace from the CQL schema; run as two version- specific fleets during the upgrade window, with service-mesh alias.
Cassandra Source Connector — Yelp in-house CDC → data pipeline system; two components (DataPipeline Materializer shipped ahead; CDC Publisher rolled with each node).
Cassandra Sink Connector — publishes from data pipeline into Cassandra (compat work was lighter weight).
Spark Cassandra Connector — direct Spark ↔ Cassandra for ML pipelines.
Pushplan Automation — Yelp's declarative schema-change system.
Ad-hoc CQL tool, schema-change service, backup/restore tool — also made 4.1-compatible.

Upgrade stages (automated end-to-end)¶

Three stages per cluster, one DC at a time, in sequence — canonicalised in patterns/pre-flight-flight-post-flight-upgrade-stages.

Pre-flight (prepare the cluster):

Communicate to relevant stakeholders.
Ensure schema versions fully agree across the cluster (canonical pre-upgrade gate — see concepts/schema-disagreement).
Disable user-initiated schema changes for the duration.
Verify a full backup exists.
Pause anti-entropy repairs for the duration.

Flight (one DC at a time, one node at a time):

Start: cluster on 3.11.
Upgrade one Cassandra node to 4.1 — its co-located CDC Publisher is updated in the same pod.
Introduce the 4.1-compatible Stargate alongside the 3.11 Stargate; both under the same service-mesh namespace; monitor p99 latency + errors per keyspace per version.
Roll the remaining nodes to 4.1 except the last one — kept on 3.11 so the 3.11 Stargate pool can still pull schema at startup.
Stop the 3.11 Stargate pool.
Upgrade the last 3.11 node to complete flight.

Post-flight:

Re-enable anti-entropy repairs.
Re-enable user-initiated schema changes.
Notify stakeholders of completion.

Init-container gossip sequence¶

Pod restart → new IP assigned
  ↓
Init container: start 3.11 Cassandra on new IP
  → gossip IP change into the ring with old version
  ↓
Flip container image to 4.1
  → gossip version change with stable IP

This sequences the two simultaneous changes (IP and version) into two distinct gossip-observable events, avoiding CASSANDRA-19244.

Caveats¶

Single cluster fleet, single workload profile. Numbers (4% / 11% / 58%) are specific to Yelp's data model, query mix, and EBS / compute class. Other fleets will see different deltas.
Root cause of post-upgrade schema disagreement on CDC-enabled clusters not fully understood — Yelp names the remediation ("dummy schema changes from multiple nodes") without a root-cause diagnosis.
Stargate 2.x regression specifics not fully enumerated — "range queries and multi-partition queries" is the only category given; the exact query shapes + reproducers aren't in the post.
Stargate runs in two flavours during the upgrade window, with a service-mesh alias — requires acceptance-test coverage across both to catch breaking API deltas. Yelp expanded test coverage; the specific suite isn't disclosed.
Cassandra 3.11 hotfix burden (accepted: rare during the upgrade window) — acknowledged as overhead of patterns/version-specific-images-per-git-branch on a critical datastore.
In-place vs new-DC trade-off is fleet-specific. Yelp rejected new-DC on cost + time + eventual-consistency grounds; a smaller fleet with a bigger EBS-right-sizing win or a regulatory reason to preserve consistency via DC-level cutover could still choose new-DC.

Extracted entities¶

Systems (named Cassandra-ecosystem components): systems/apache-cassandra, systems/kubernetes, systems/stargate-cassandra-proxy, systems/cassandra-source-connector, systems/kubernetes-init-containers, systems/yelp-pushplan-automation, systems/spark-cassandra-connector.

Concepts: concepts/rolling-upgrade, concepts/mixed-version-cluster, concepts/cassandra-cdc-commit-log, concepts/anti-entropy-repair-pause, concepts/schema-disagreement, concepts/init-container-ip-gossip-pre-migration, concepts/checkpointed-automation-script, concepts/in-place-vs-new-dc-upgrade, concepts/observability-before-migration, concepts/performance-regression-from-mid-upgrade-state.

Patterns: patterns/version-specific-images-per-git-branch, patterns/pre-flight-flight-post-flight-upgrade-stages, patterns/dual-run-version-specific-proxies, patterns/production-qualification-criteria-upfront, patterns/benchmark-in-own-environment-before-upgrade.

Source¶

systems/apache-cassandra — the target datastore; wiki's canonical Cassandra page; this ingest adds the first-party operational Cassandra-upgrade Seen-in.
systems/kubernetes — the orchestration substrate; Cassandra operator + init-container pattern.
concepts/rolling-upgrade — the architectural choice; this ingest is the first datastore-tier rolling-upgrade Seen-in on the wiki outside the PlanetScale/Vitess framing.
concepts/schema-evolution — the pre-flight schema- agreement gate and post-upgrade disagreement case belong here.
companies/yelp — sixth Yelp ingest; opens the datastore- platform / Cassandra-upgrade axis.
patterns/upstream-the-fix — Yelp's mode here is not upstream: they kept patches and infrastructure glue internal, absorbing the 3.11 hotfix burden for upgrade independence. Sibling contrast.