Skip to content

PATTERN Cited by 1 source

Split cluster by market for load isolation

The pattern

Incident-time (not steady-state): when an Elasticsearch cluster is saturated and the cluster serves multiple market groups (country subsets), split the markets into separate clusters using Elasticsearch's node allocation settings to fence which shards live on which nodes. The smaller / less-hot market gets moved to a new cluster; the larger market stays on the original substrate. Once split, the problem localises to whichever cluster remains hot — which is both an isolation win (the uninvolved market stops suffering collateral damage) and a diagnostic wedge (the split proves the problem is traffic-localised, not infrastructure-global).

Canonical wiki anchor: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.

"As an experiment, it was decided to split the markets into two separate clusters, to see if that would help alleviate the load or isolate the problem. If the relocated market would not have issues, it could indicate that the problem was related to the specific queries or infrastructure associated with the remaining market. The market split would be done by using the node allocation settings in Elasticsearch, which allow for controlling which nodes hold which shards. By specifying different node groups for the two markets, the data could be effectively split."

How node allocation settings do the work

Elasticsearch's shard allocation awareness and per-index allocation filters (index.routing.allocation.include.*, *.exclude.*, *.require.*) let operators pin indices to a subset of nodes labelled with attributes of their choosing (e.g. node.attr.market=EU, node.attr.market=DE). The mechanism is steady-state infrastructure — the pattern's contribution is using it as an incident-response primitive:

  1. Add / label a fresh node group with node.attr.market=<smaller- market>.
  2. Set the smaller market's index allocation settings to require.market=<smaller-market>.
  3. ES relocates the smaller market's shards to the new nodes.
  4. The original nodes end up serving only the larger market's indices.

The migration is online — queries continue to be served throughout the shard relocation. The cost is the shard-transfer bandwidth (which is why the Zalando 5-lever load-shed includes reducing shard replicas and throttling ingestion to cut the relocation footprint).

Interaction with the steady-state market-group primitive

Zalando's steady-state architecture uses market groups for serving APIs like PRAPI — multiple instances, each on a disjoint country subset, with dynamic traffic-shifting. This pattern is the incident-time analogue of the same primitive, applied to the search storage tier, using Elasticsearch's internal shard-placement mechanism instead of a routing-layer fleet.

Axis Market-group serving API Market-split cluster (this pattern)
Lifecycle Steady-state, designed in Incident-time response
Separation mechanism Traffic routing between instances Shard allocation across node groups
Lead time Pre-provisioned Minutes-to-hours (relocation bandwidth limited)
Reversibility Routing-layer re-point Relocate shards back / merge clusters
Blast-radius discipline Country-level Country-level (same axis)
Typical trigger Test traffic, canary, maintenance Cluster saturation, pathological traffic

Why this beat the rest of the immediate-mitigation playbook

The Zalando on-call responder ran the full first-line playbook:

  1. Apply longer cache expirations.
  2. Disable non-critical requests.
  3. Apply lower cluster-wide query-termination thresholds.
  4. Scale out coordinator nodes.
  5. Scale out data nodes.

"These actions, however, have not provided even a temporary relief." All five are horizontally adding capacity or lengthening grace periods — they are resource-adding, not isolation-adding. The pathological per-query cost absorbed every added capacity unit as fast as it came online.

The market split is structural — it reduces the load on the remaining cluster by a fixed fraction proportional to the removed market's share, and gives the team a partition in which to run peer-tested healthy traffic to prove the topology is not itself the problem.

When not to use this pattern

  • When the blast axis isn't geographical. If the bad caller doesn't correlate with a specific market, splitting markets doesn't help — the pathological traffic distributes across both resulting clusters.
  • When shard relocation bandwidth is itself the bottleneck. On an already-saturated cluster, sending hundreds of gigabytes of shard data across the cluster to a new node group may aggravate the incident before it alleviates it. Zalando addressed this by reducing replicas first (halving the shard-transfer footprint) and throttling ingestion (no new writes competing for CPU during relocation).
  • When the clusters share the ingest pipeline upstream and the upstream is what's pathological. Splitting the storage doesn't help if the misbehaving caller is fanning out to both clusters' upstream.
  • When there is no fine-grained node-allocation discipline in place. This pattern assumes indices are already labelled / allocation-filterable per market, or that they can be re- labelled cheaply.

Companion load-shedding levers Zalando rolled in parallel

Along with the market split, Zalando's team rolled out a 5-lever cocktail — the split succeeds because it's part of a coordinated response, not a solo action:

  1. Reduce shard replicas (storage-side) — to shrink the relocation footprint.
  2. Throttle ingestion to a full stop (storage-side) — no writes during relocation.
  3. Split the markets (storage-side) — this pattern.
  4. Presentation-layer control plane (app-side) — turn off non-critical calls, reduce parallel queries per request, lengthen caches for hot queries.
  5. Search-steering down-sampling (app-side) — sample fewer requests into heavier ML-model integrations and promotion- enrichment flows, fall back to simpler ranking.

Levers 4 and 5 are instances of load shedding at the presentation boundary — app-layer cost reduction downstream bound, coordinated with the storage-tier isolation move.

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Zalando's Search & Browse team split the two largest markets (which had been co-tenant on a single ES cluster) by moving the smaller market to a freshly provisioned node group via node.attr.market allocation filters. The split simultaneously (a) relieved the saturated cluster of the smaller market's load, (b) proved the pathology was localised to the larger market's traffic (and therefore to a client-specific issue), and (c) set up the trace-altitude root-cause investigation that identified the internal self-inflicted-DoS caller via systems/lightstep.
Last updated · 507 distilled / 1,218 read