Skip to content

CONCEPT Cited by 1 source

Reindex-based cluster upgrade

Definition

A reindex-based cluster upgrade is a stateful-datastore major- version upgrade strategy where, instead of upgrading the live cluster node-by-node (rolling upgrade), the operator provisions a fresh cluster on the new version, re-populates it from the old cluster (via snapshot-restore + live-write shadowing + data-stream reset), validates against live traffic in A/B shadow mode, and flips routing to cut over.

It is the datastore-tier instance of Blue/Green deployment — two full environments maintained for the migration window, a scripted cutover, and the old side kept alive briefly as rollback candidate.

Mechanism ingredients

  1. Fresh cluster provisioned on the new version. Separate resource pool, separate endpoints, separate storage. No mixed-version state is ever present.
  2. Initial data transfer — usually snapshot restore from the old cluster's latest backup into the new cluster. For Elasticsearch this is file-level Lucene segment restore from object storage; fast compared to streaming-based replication.
  3. Live-write shadow via ingress-layer traffic duplication (see concepts/traffic-shadowing-via-ingress). From the moment shadow is enabled (point B) forward, writes land on both clusters.
  4. Gap closure for the interval between snapshot-time (point A) and shadow-enable-time (point B) — typically resetting an upstream data-stream consumer offset to just-before-A so events between A and B replay into the new cluster. Without this step the new cluster would be missing all writes in [A, B].
  5. Read-side shadow — once write convergence is verified, mirror query traffic to the new cluster too for A/B comparison (latency, error rate, result parity).
  6. Cutover by flipping routing to the new cluster. Old cluster stays warm for rollback.
  7. Tear-down of the old cluster after a verification window.

Trade-off vs rolling upgrade

See concepts/in-place-vs-new-dc-upgrade for the full trade table. Short version: reindex Blue/Green trades 2× resource cost for a clean rollback and no mixed-version risk.

Seen in

  • sources/2023-11-19-zalando-migrating-from-elasticsearch-7-to-8 — Canonical wiki instance. Zalando's Search & Browse department chose reindex Blue/Green over Elastic's recommended rolling upgrade for their multi-cluster Elasticsearch 7.17→8.x migration (28 per-country-language catalogs, multi-terabyte each). Rejected rolling upgrade explicitly: "during this time, the cluster would be in a mixed state, with some nodes being upgraded and some not, with relocating shards, and in general in a degraded state... If we faced data loss, we'd have no choice but to go with restoring the data from snapshots and then resetting the input streams to bring the data up to date." The chosen reindex-based path gave "almost instantaneous" rollback via routing flip. Every mechanism ingredient was explicitly named: Skipper teeLoopback for intake shadow, snapshot-restore for initial transfer, stream-reset for gap closure, per-endpoint A/B dashboards for verification, per-cluster routing flip for cutover.

When reindex Blue/Green is the right choice

Per the Zalando vs Yelp Cassandra contradiction on sources/2023-11-19-zalando-migrating-from-elasticsearch-7-to-8:

  • Cheap snapshot-restore (Lucene segments to S3, not row-level streaming).
  • Per-shard or per-tenant cluster topology (migration units bounded independently).
  • Acceptable fleet-doubling cost for the migration window.
  • No strong single-cluster consistency guarantee (e.g. no EACH_QUORUM analogue) that would break across two clusters.

If any of those four is not true, rolling upgrade is usually preferred.

Last updated · 501 distilled / 1,218 read