Skip to content

PATTERN Cited by 1 source

Shadow traffic + reindex Blue/Green for stateful-datastore major upgrade

Problem

Upgrading a stateful datastore (search engine, database, key-value store) across a major version boundary where:

  • The cluster holds terabytes-to-petabytes of data per shard/cluster.
  • Node-by-node rolling upgrade passes through a mixed-version state that can take hours-to-days and risks data loss if something goes wrong (forcing snapshot-restore + stream-reset for every affected index).
  • Rollback during a rolling upgrade is "reverse-rolling" — slow and not atomic.
  • There is acceptable budget for a 2× cluster footprint during the migration window.

Solution

Provision a fresh cluster on the new major version, populate it by snapshot-restore + live-write shadow + data-stream reset, verify data convergence on side-by-side dashboards, shadow query traffic for A/B comparison, then flip routing to cut over.

The full procedure

Canonical instance from Zalando's Elasticsearch 7.17→8.x migration:

  1. Deploy fresh target cluster on the new major version.
  2. Set up monitoring with side-by-side panels per endpoint (latency, error rate, resource use, index sizes + delta).
  3. Create index templates on the new cluster before any data arrives — otherwise bulk writes will land without a template. In Zalando's case, templates are pushed by application restart, so template-creation endpoints are shadowed too. See patterns/index-template-shadow-before-data-shadow.
  4. Restore data from the latest snapshot. Time A = snapshot.
  5. Enable intake / write-side shadow on the ingress layer. Time B = shadow-enable. New cluster now receives all live writes. Implementation: patterns/teeloopback-intake-shadowing.
  6. Close the [A, B] gap by resetting upstream data streams to a point just before A. The events between A and B replay into the new cluster.
  7. Enable shadow query traffic. New cluster serves queries in parallel with the old one; responses compared for parity and performance.
  8. Verify — data convergence (index-size delta → 0), query parity (result-set overlap, ranking stability), latency budget (new cluster ≤ old cluster on p99), no new error classes.
  9. Switch live traffic — gradually increase live % on new, decrease on old. Rollback at any step = flip routing back.
  10. Tear down the old cluster resources after a verification window.

Per-cluster vs per-fleet

Zalando's 28-cluster-per-country topology let them migrate each country independently: the procedure matured on the lowest-stakes cluster first, then propagated across the fleet. Single-cluster fleets don't have this luxury and must validate on a clone or staging cluster.

Trade-offs

Axis vs Rolling upgrade
Fleet cost during migration +100 % (2 full clusters)
Total migration time Hours (snapshot + catch-up + verify) vs days-to-weeks (node-by-node)
Rollback shape Routing flip (seconds) vs reverse rolling (hours)
Mixed-version risk None
Application coordination Minimal (ingress change)
Suitable for stateful datastores with expensive streaming No — snapshots are faster

Canonical instance

sources/2023-11-19-zalando-migrating-from-elasticsearch-7-to-8 — Zalando's Search & Browse department ran this pattern across 28 per-country-language Elasticsearch catalogs. Every mechanism ingredient is named and the verbatim Skipper RouteGroup YAML for the intake-shadow and template-shadow routes is included. The ingredient that makes this pattern work at Zalando's scale is Skipper's teeLoopback filter, which makes ingress-layer traffic duplication trivial.

When not to use this

  • Single-cluster datastore with strong single-cluster consistency semantics (e.g. Cassandra's EACH_QUORUM, see Yelp Cassandra 3.11→4.1) — downgrading consistency across two clusters for the migration window is worse than mixed-version risk.
  • Streaming-based replication only — if your datastore can't snapshot-restore cheaply, the initial data transfer is as slow as rolling upgrade's relocation.
  • Fleet cost can't double — usually the binding constraint for hyperscale deployments.

Gaps in the wiki record

  • No published numbers on cost delta, migration duration, or incident count during a Zalando cluster cutover.
  • The stream-reset step's upstream technology (Kafka? Nakadi?) and reset semantics are not named.
  • No disclosure of the staleness window during shadow-query mode.
Last updated · 501 distilled / 1,218 read