PATTERN

Shadow traffic + reindex Blue/Green for stateful-datastore major upgrade¶

Problem¶

Upgrading a stateful datastore (search engine, database, key-value store) across a major version boundary where:

The cluster holds terabytes-to-petabytes of data per shard/cluster.
Node-by-node rolling upgrade passes through a mixed-version state that can take hours-to-days and risks data loss if something goes wrong (forcing snapshot-restore + stream-reset for every affected index).
Rollback during a rolling upgrade is "reverse-rolling" — slow and not atomic.
There is acceptable budget for a 2× cluster footprint during the migration window.

Solution¶

Provision a fresh cluster on the new major version, populate it by snapshot-restore + live-write shadow + data-stream reset, verify data convergence on side-by-side dashboards, shadow query traffic for A/B comparison, then flip routing to cut over.

The full procedure¶

Canonical instance from :

Deploy fresh target cluster on the new major version.
Set up monitoring with side-by-side panels per endpoint (latency, error rate, resource use, index sizes + delta).
Create index templates on the new cluster before any data arrives — otherwise bulk writes will land without a template. In Zalando's case, templates are pushed by application restart, so template-creation endpoints are shadowed too. See patterns/index-template-shadow-before-data-shadow.
Restore data from the latest snapshot. Time A = snapshot.
Enable intake / write-side shadow on the ingress layer. Time B = shadow-enable. New cluster now receives all live writes. Implementation: patterns/teeloopback-intake-shadowing.
Close the [A, B] gap by resetting upstream data streams to a point just before A. The events between A and B replay into the new cluster.
Enable shadow query traffic. New cluster serves queries in parallel with the old one; responses compared for parity and performance.
Verify — data convergence (index-size delta → 0), query parity (result-set overlap, ranking stability), latency budget (new cluster ≤ old cluster on p99), no new error classes.
Switch live traffic — gradually increase live % on new, decrease on old. Rollback at any step = flip routing back.
Tear down the old cluster resources after a verification window.

Per-cluster vs per-fleet¶

Zalando's 28-cluster-per-country topology let them migrate each country independently: the procedure matured on the lowest-stakes cluster first, then propagated across the fleet. Single-cluster fleets don't have this luxury and must validate on a clone or staging cluster.

Trade-offs¶

Axis	vs Rolling upgrade
Fleet cost during migration	+100 % (2 full clusters)
Total migration time	Hours (snapshot + catch-up + verify) vs days-to-weeks (node-by-node)
Rollback shape	Routing flip (seconds) vs reverse rolling (hours)
Mixed-version risk	None
Application coordination	Minimal (ingress change)
Suitable for stateful datastores with expensive streaming	No — snapshots are faster

Canonical instance¶

— Zalando's Search & Browse department ran this pattern across 28 per-country-language Elasticsearch catalogs. Every mechanism ingredient is named and the verbatim Skipper RouteGroup YAML for the intake-shadow and template-shadow routes is included. The ingredient that makes this pattern work at Zalando's scale is Skipper's teeLoopback filter, which makes ingress-layer traffic duplication trivial.

patterns/blue-green-database-deployment — the Aurora-family database-tier instance; differs in that Aurora handles the write sync internally via copy-on-write + binlog, not ingress-layer shadow.
patterns/teeloopback-intake-shadowing — the Skipper ingress primitive used in step 5.
patterns/index-template-shadow-before-data-shadow — the sequencing discipline for step 3.
patterns/routing-error-duplicate-recovery — the recovery playbook when step 9's routing flip is mis-configured.

When not to use this¶

Single-cluster datastore with strong single-cluster consistency semantics (e.g. Cassandra's EACH_QUORUM, see Yelp Cassandra 3.11→4.1) — downgrading consistency across two clusters for the migration window is worse than mixed-version risk.
Streaming-based replication only — if your datastore can't snapshot-restore cheaply, the initial data transfer is as slow as rolling upgrade's relocation.
Fleet cost can't double — usually the binding constraint for hyperscale deployments.

Gaps in the wiki record¶

No published numbers on cost delta, migration duration, or incident count during a Zalando cluster cutover.
The stream-reset step's upstream technology (Kafka? Nakadi?) and reset semantics are not named.
No disclosure of the staleness window during shadow-query mode.