PATTERN Cited by 1 source
Shadow traffic + reindex Blue/Green for stateful-datastore major upgrade¶
Problem¶
Upgrading a stateful datastore (search engine, database, key-value store) across a major version boundary where:
- The cluster holds terabytes-to-petabytes of data per shard/cluster.
- Node-by-node rolling upgrade passes through a mixed-version state that can take hours-to-days and risks data loss if something goes wrong (forcing snapshot-restore + stream-reset for every affected index).
- Rollback during a rolling upgrade is "reverse-rolling" — slow and not atomic.
- There is acceptable budget for a 2× cluster footprint during the migration window.
Solution¶
Provision a fresh cluster on the new major version, populate it by snapshot-restore + live-write shadow + data-stream reset, verify data convergence on side-by-side dashboards, shadow query traffic for A/B comparison, then flip routing to cut over.
The full procedure¶
Canonical instance from Zalando's Elasticsearch 7.17→8.x migration:
- Deploy fresh target cluster on the new major version.
- Set up monitoring with side-by-side panels per endpoint (latency, error rate, resource use, index sizes + delta).
- Create index templates on the new cluster before any data arrives — otherwise bulk writes will land without a template. In Zalando's case, templates are pushed by application restart, so template-creation endpoints are shadowed too. See patterns/index-template-shadow-before-data-shadow.
- Restore data from the latest snapshot. Time
A= snapshot. - Enable intake / write-side shadow on the ingress layer. Time
B= shadow-enable. New cluster now receives all live writes. Implementation: patterns/teeloopback-intake-shadowing. - Close the [A, B] gap by resetting upstream data streams to
a point just before
A. The events betweenAandBreplay into the new cluster. - Enable shadow query traffic. New cluster serves queries in parallel with the old one; responses compared for parity and performance.
- Verify — data convergence (index-size delta → 0), query parity (result-set overlap, ranking stability), latency budget (new cluster ≤ old cluster on p99), no new error classes.
- Switch live traffic — gradually increase live % on new, decrease on old. Rollback at any step = flip routing back.
- Tear down the old cluster resources after a verification window.
Per-cluster vs per-fleet¶
Zalando's 28-cluster-per-country topology let them migrate each country independently: the procedure matured on the lowest-stakes cluster first, then propagated across the fleet. Single-cluster fleets don't have this luxury and must validate on a clone or staging cluster.
Trade-offs¶
| Axis | vs Rolling upgrade |
|---|---|
| Fleet cost during migration | +100 % (2 full clusters) |
| Total migration time | Hours (snapshot + catch-up + verify) vs days-to-weeks (node-by-node) |
| Rollback shape | Routing flip (seconds) vs reverse rolling (hours) |
| Mixed-version risk | None |
| Application coordination | Minimal (ingress change) |
| Suitable for stateful datastores with expensive streaming | No — snapshots are faster |
Canonical instance¶
sources/2023-11-19-zalando-migrating-from-elasticsearch-7-to-8
— Zalando's Search & Browse department ran this pattern across 28
per-country-language Elasticsearch catalogs. Every mechanism
ingredient is named and the verbatim Skipper RouteGroup YAML for
the intake-shadow and template-shadow routes is included. The
ingredient that makes this pattern work at Zalando's scale is
Skipper's teeLoopback filter, which
makes ingress-layer traffic duplication trivial.
Related patterns¶
- patterns/blue-green-database-deployment — the Aurora-family database-tier instance; differs in that Aurora handles the write sync internally via copy-on-write + binlog, not ingress-layer shadow.
- patterns/teeloopback-intake-shadowing — the Skipper ingress primitive used in step 5.
- patterns/index-template-shadow-before-data-shadow — the sequencing discipline for step 3.
- patterns/routing-error-duplicate-recovery — the recovery playbook when step 9's routing flip is mis-configured.
When not to use this¶
- Single-cluster datastore with strong single-cluster consistency
semantics (e.g. Cassandra's
EACH_QUORUM, see Yelp Cassandra 3.11→4.1) — downgrading consistency across two clusters for the migration window is worse than mixed-version risk. - Streaming-based replication only — if your datastore can't snapshot-restore cheaply, the initial data transfer is as slow as rolling upgrade's relocation.
- Fleet cost can't double — usually the binding constraint for hyperscale deployments.
Gaps in the wiki record¶
- No published numbers on cost delta, migration duration, or incident count during a Zalando cluster cutover.
- The stream-reset step's upstream technology (Kafka? Nakadi?) and reset semantics are not named.
- No disclosure of the staleness window during shadow-query mode.