Skip to content

ZALANDO 2023-11-19 Tier 2

Read original ↗

Migrating From Elasticsearch 7.17 to Elasticsearch 8.x: Pitfalls and Learnings

Summary

Zalando's Search & Browse department ran a Blue/Green cluster migration to upgrade their multi-cluster Elasticsearch deployment from 7.17 to 8.x across 28 per-country-language catalogs. They explicitly rejected the in-place rolling upgrade Elastic recommends and instead provisioned fresh ES8 clusters, restored from snapshot, shadowed intake traffic via Skipper teeLoopback routes, reconciled the snapshot-to-shadow data gap by resetting input data streams, shadowed query traffic for A/B comparison against the live ES7 fleet, then flipped routing per-cluster. A three-classed Testcontainers test matrix (NewClient+OldES / OldClient+NewES / NewClient+NewES) caught several client-library incompatibilities before cluster cutover. The primary motivation was the Elasticsearch 8.0 approximate kNN (ANN) search feature — their vector-search workload was on brute-force exact kNN. The post is explicit that approximate kNN DSL was missing from the unofficial Scala elastic4s client, driving the Scala→Java client migration as a prerequisite for the cluster upgrade itself.

Key takeaways

  1. Reindex / Blue-Green over rolling upgrade for a stateful multi-cluster datastore at scale. Zalando explicitly rejected Elastic's recommended rolling upgrade (data nodes first → other non-master → master) because their clusters span "terabytes of data" and a long mixed-version state during node-by-node upgrade with relocating shards felt "too dangerous" — data loss would force snapshot restore + stream reset for every index. Blue/Green gave them a nearly-instantaneous routing-based rollback path and a side-by-side A/B comparison window. (Direct in-place vs new-DC upgrade contrast with Yelp's Cassandra 3.11→4.1 decision to go in-place; see Contradiction below.) (Source: this post)

  2. Traffic shadowing via Skipper teeLoopback + Tee filter as the data-reconciliation primitive. The Blue/Green procedure needs ES8 to receive both the snapshot restore and the live writes in-flight during the window between snapshot time (point A) and shadow-enablement time (point B). Zalando captured /_bulk and /_alias/{index}_write endpoints via two-rule Skipper RouteGroup: rule 1 on the old backend applies teeLoopback("intake_shadow") to duplicate the request; rule 2 matches the Tee("intake_shadow") predicate and routes to the new backend. A Weight(2) hack prevents the routes from colliding with Traffic() routes. The A→B gap was closed by resetting data streams to just-before-snapshot. (Source: this post, with verbatim RouteGroup YAML)

  3. Shadow mapping/template traffic before shadowing bulk writes. The intake shadow alone wouldn't work if index templates weren't already present on ES8 — new indices created on ES7 first would hit ES8 with no template. Zalando captured /:index/_mapping and /_template/* paths in the RouteGroup too, so every template write from an application restart lands on both clusters. This is the template-first-data-second principle: patterns/index-template-shadow-before-data-shadow. (Source: this post)

  4. Three-classed Testcontainers parity matrix catches version-mix regressions before cluster cutover. Because Zalando assumed they'd have to run their application against both ES7 and ES8 simultaneously during the multi-week migration, they explicitly tested three combinations with Testcontainers pinning Elasticsearch to three versions (7.17.9, 8.6.2, 8.8.2): NewClientWithOldElasticTest, OldClientWithNewElasticTest, NewClientWithNewElasticTest. They skipped OldClientWithOldElasticTest because they already had it in production. Each class exercised index create / doc create / kNN vector mapping / kNN vector index / kNN search / index delete / client close. (Source: this post, with verbatim Scala ESContainers helper)

  5. Compatibility mode (7.x HLRC against 8.x cluster) as gradual client-migration enabler. Elastic provides a compatibility mode that lets the 7.17 High Level REST Client talk to an ES8 cluster by sending Accept: application/vnd.elasticsearch+json; compatible-with=7. Zalando used this to decouple the cluster upgrade from the client upgrade: cluster upgrades first (all at once via Blue/Green), application gradually migrates from HLRC to the Elasticsearch Java API Client afterwards. Without this primitive, the 443k-LOC Origami rewrite would be a prerequisite, not a parallel track. (Source: this post)

  6. Approximate kNN DSL as the single load-bearing feature driving the client-library choice. The unofficial Scala elastic4s client was structurally attractive (type-safe DSL, Scala futures, Option-returning APIs) but lagged ES release cadence (ES was at 8.7 when elastic4s was at 8.5.4) and crucially had no DSL for kNN search. A kNN query could still be sent as raw JSON, but "it was not a pretty option." Since approximate-kNN was the main reason for the upgrade, this was a fatal gap — Zalando chose the official Java client. This is a canonical instance of a single must-have API driving a Scala→Java rewrite decision in a large (443k LOC / 846 file) Scala codebase. (Source: this post)

  7. Nested-sort path field as an ES7→ES8 undocumented-behavior break. A sort clause of the form {"trace.origami.timestamp": {"order": "desc"}} — where trace.origami.timestamp is a nested field — worked in ES7 despite the docs saying nested.path is required. ES8 actually enforces the docs: query_shard_exception: it is mandatory to set the [nested] context on the nested sort field. Fix is to add "nested": {"path": "trace"}. Elastic confirmed on the forum this was a bug in ES7 that was fixed in ES8. Canonical instance of fix-in-upgrade breaks existing queries. (Source: this post, with Elastic forum link)

  8. Four named ES8 setting / query breaks discovered via parity-testing. (a) _type field removed from search response → strip from test resource JSONs; (b) is_write_index no longer accepts null when creating alias → set explicitly; (c) range query with numeric upper/lower bounds on date/epoch_second fields changed behaviour — Elastic team declared this a feature, not a bug; fix is to stringify the bounds before sending; (d) action.destructive_requires_name defaults to true in ES8 (was false in ES7) → e2e tests that drop all indexes by wildcard crash; test-only cluster setting update added. (Source: this post)

  9. Routing-error cluster-name-collision incident produced a duplicated-product outage. During per-country cutover, a routing mistake caused one extra index to be created automatically under the same alias while the original still existed, so queries through the alias hit two indices with duplicate content. "It's better to show the product twice than not to show it at all" — fixed by dropping the stray index. Canonical small-scale instance of why Blue/Green cutover routing must be scripted + reviewed. (Source: this post)

  10. **Per-cluster procedure is Blue/Green at scale with a Lightstep

    • Grafana dashboard for data-sync verification. Dashboards display both clusters side-by-side per endpoint (latency, error rate, CPU, memory, index sizes and the delta between them). Index-size delta is the primary data-sync SLI — "we could see if say restoring from the snapshot was indeed successful and if the follow-up of shadow intake and stream resetting was resulting in data converging in the end." (Source: this post)

Load-bearing operational numbers

  • 28 domains (country × language combos), each with its own catalog, each requiring an independent cluster migration.
  • ~450,000 items shown on the German women's root page (Zalando notes "this is just how many items at most get scanned to show you the first page" — actual catalog is "much higher").
  • Multiple terabytes of data per cluster → rolling-upgrade- relocation cost made the risk unacceptable.
  • 443k lines of code in 846 files in Origami (the Zalando Core Search API application) — the application that directly issues ES queries.
  • 3 Elasticsearch versions pinned in Testcontainers: 7.17.9, 8.6.2, 8.8.2.
  • 3 client/cluster combinations tested (not 4 — Old+Old was already covered in production).

Systems extracted

  • Elasticsearch (pre-existing; new Seen-in for 7→8 upgrade at multi-cluster scale)
  • Elasticsearch Java API Client (new) — the official successor to the deprecated High Level REST Client; what Zalando migrated Origami to.
  • Origami (new) — Zalando's Scala-based Search Core API application that coordinates search-intelligence microservices and speaks to Elasticsearch. 443k LOC, 846 files.
  • elastic4s (new) — unofficial Scala client for Elasticsearch by sksamuel, stub page; rejected-alternative on this axis.
  • Skipper (pre-existing; new Seen-in for teeLoopback traffic shadowing in a cluster-migration context)
  • Testcontainers (pre-existing; new Seen-in for multi-version parity-matrix testing)
  • Lightstep (new stub) — distributed-tracing platform used for the side-by-side cluster comparison streams.
  • Grafana (pre-existing; new Seen-in)

Concepts extracted

  • Reindex-based cluster upgrade (new) — the alternative to in-place rolling upgrade for stateful datastores; provision new cluster, restore from snapshot, catch up via shadow writes, verify, flip routing.
  • Approximate vs exact kNN (new) — the Elastic-named distinction: script_score brute-force exact kNN scans every matching doc (accurate, slow, doesn't scale) vs approximate kNN with HNSW index (lower latency, slower indexing, imperfect recall).
  • Elasticsearch compatibility mode (new) — the 7.17 HLRC with REST API compatibility header set can talk to an 8.x cluster; decouples cluster upgrade from client upgrade.
  • Traffic shadowing via ingress (new) — mirror production traffic to a second backend from the ingress layer using a loopback filter + predicate pair; duplicate-request rather than client-side dual-write.
  • Nested-sort path requirement (new) — ES docs always required nested.path for sort on a nested field; ES7 incorrectly accepted absent path, ES8 enforces; concrete instance of query_shard_exception on upgrade.
  • Undocumented-behavior drift on version upgrade (new) — queries that rely on accepted- but-undocumented behavior break at the major-version boundary when the implementation catches up to the spec. ES7→ES8 nested-sort is the canonical instance.

Patterns extracted

  • Shadow traffic + reindex Blue/Green for stateful-datastore major upgrade (new) — deploy fresh cluster, restore from snapshot, set up intake shadow via ingress-layer loopback, reset data streams to close the snapshot-to-shadow gap, then shadow query traffic for A/B comparison, then flip routing. Canonical instance at Zalando ES 7→8.
  • Compatibility-mode client transition (new) — upgrade the cluster first (leaving the old client talking to the new cluster via a compat header), then migrate the client gradually. Zalando ES is the canonical instance; generalisable to any datastore with a versioned REST API and a compat header.
  • Multi-version Testcontainers parity suite (new) — parameterise the Testcontainers image-version tag, run the same behavioural test suite against the N combinations of (client version × server version), skip the (old, old) combination since it's production. The test suite is the minimal usage surface the application actually exercises (not exhaustive API coverage). Canonical instance at Zalando ES 7→8 (3 combinations).
  • Skipper teeLoopback intake shadowing (new) — ingress-layer two-rule pattern: rule 1 matches the request on the live backend and adds teeLoopback("name") filter to duplicate; rule 2 matches Tee("name") predicate and routes to the shadow backend. A Weight hack is required to prevent collision with Traffic() routes. Canonical Skipper instance beyond the 2021-03-01 load-test conductor usage.
  • Template-shadow before data-shadow (new) — on an ingress-mediated cluster shadow, also capture template-creation endpoints (/:index/_mapping, /_template/*) so that when applications restart and push templates, both clusters receive them. Otherwise intake will arrive at the shadow cluster with no template registered.
  • Routing-error duplicate- content recovery (new) — during per-cluster cutover, if an unintended index gets created behind the same alias as the original, queries through the alias return each document twice; fix is to drop the mistakenly-created index. Explicit Zalando stance: "better to show twice than not at all."

Contradiction: reindex Blue/Green vs in-place rolling upgrade

Zalando chose reindex Blue/Green for ES 7→8. Yelp chose in-place rolling upgrade for Cassandra 3.11→4.1. Both decisions were explicit and grounded; they point in opposite directions on in-place vs new-DC upgrade.

Axis Yelp (Cassandra 3.11→4.1) Zalando (ES 7.17→8.x)
Choice In-place rolling Reindex Blue/Green
Fleet-doubling cost Rejected as prohibitive Accepted
Streaming time Rejected as "weeks" Snapshot-restore minutes + stream catch-up
Consistency during transition EACH_QUORUM preserved → chose this Not applicable — separate clusters
Rollback shape Reverse rolling Flip routing ("almost instantaneous")
Mid-upgrade state Mixed version in one cluster Two clusters, one version each

Both are right. The load-bearing differences are:

  • Cassandra is a single-cluster system with a strong consistency model (EACH_QUORUM) that survives mixed-version state via gossip and well-understood protocol compatibility. ES's mixed-version state during rolling upgrade involves shard relocations (data movement) rather than pure leader-election + protocol negotiation — the cost and risk profile is different.
  • ES's snapshot-restore is cheap and file-level (Lucene segment snapshots to object storage, O(minutes) to restore); Cassandra streaming is row-level at petabyte scale ("weeks").
  • Zalando's per-country cluster topology already gives them 28 independent migration units — they can validate the procedure on a small cluster first and keep cost per unit bounded. Yelp's Cassandra fleet is a single global cluster per service.

Decision rule: if the datastore has cheap snapshots + per-country or per-shard cluster topology + short acceptable streaming time, prefer reindex Blue/Green. If the datastore has strong-consistency guarantees that only hold within a single cluster + petabyte-scale streaming time, prefer in-place rolling.

Caveats

  • No SLO / latency / QPS numbers. Zalando does not disclose the query QPS, p99 / p999 latency budget, or indexing throughput of their ES deployment. "Hundreds of milliseconds" is the only latency reference (Skipper static-site pages, not ES).
  • The "weeks" claim for streaming is Zalando's mental model, not measured. They didn't actually attempt rolling upgrade and time it — they rejected it based on cluster size and risk profile.
  • No downstream consumer impact described. "Services that use ES" is named but the number is not; failure-mode analysis of those downstream services under the routing flip is not disclosed.
  • Index template drift during the migration window is handled reactively. The template-shadow catches templates pushed by application restarts; templates updated imperatively outside the application path would be missed. Not discussed.
  • The stream-reset mechanism (the A-to-B gap closure) is described at a high level but the specific stream technology (Kafka? Nakadi?) and reset semantics (consumer group offset reset? full reprocessing?) are not named.
  • Team-size, migration duration, and cluster count ingested vs total are all undisclosed. "One of our larger clusters" is the only scale signal beyond "terabytes" and "28 domains."
  • The routing-error duplicate incident is described as "quickly fixed" but the discovery latency and customer-facing duration are not disclosed.
  • The elastic4s rejection is dated mid-2023. The library has since shipped kNN DSL in later versions; the post's client-choice conclusion is time-bound.

Source

Last updated · 501 distilled / 1,218 read