PATTERN Cited by 1 source
Split cluster by market for load isolation¶
The pattern¶
Incident-time (not steady-state): when an Elasticsearch cluster is saturated and the cluster serves multiple market groups (country subsets), split the markets into separate clusters using Elasticsearch's node allocation settings to fence which shards live on which nodes. The smaller / less-hot market gets moved to a new cluster; the larger market stays on the original substrate. Once split, the problem localises to whichever cluster remains hot — which is both an isolation win (the uninvolved market stops suffering collateral damage) and a diagnostic wedge (the split proves the problem is traffic-localised, not infrastructure-global).
Canonical wiki anchor: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.
"As an experiment, it was decided to split the markets into two separate clusters, to see if that would help alleviate the load or isolate the problem. If the relocated market would not have issues, it could indicate that the problem was related to the specific queries or infrastructure associated with the remaining market. The market split would be done by using the node allocation settings in Elasticsearch, which allow for controlling which nodes hold which shards. By specifying different node groups for the two markets, the data could be effectively split."
How node allocation settings do the work¶
Elasticsearch's shard allocation awareness and per-index
allocation filters (index.routing.allocation.include.*,
*.exclude.*, *.require.*) let operators pin indices to a
subset of nodes labelled with attributes of their choosing
(e.g. node.attr.market=EU, node.attr.market=DE). The
mechanism is steady-state infrastructure — the pattern's
contribution is using it as an incident-response primitive:
- Add / label a fresh node group with
node.attr.market=<smaller- market>. - Set the smaller market's index allocation settings to
require.market=<smaller-market>. - ES relocates the smaller market's shards to the new nodes.
- The original nodes end up serving only the larger market's indices.
The migration is online — queries continue to be served throughout the shard relocation. The cost is the shard-transfer bandwidth (which is why the Zalando 5-lever load-shed includes reducing shard replicas and throttling ingestion to cut the relocation footprint).
Interaction with the steady-state market-group primitive¶
Zalando's steady-state architecture uses market groups for serving APIs like PRAPI — multiple instances, each on a disjoint country subset, with dynamic traffic-shifting. This pattern is the incident-time analogue of the same primitive, applied to the search storage tier, using Elasticsearch's internal shard-placement mechanism instead of a routing-layer fleet.
| Axis | Market-group serving API | Market-split cluster (this pattern) |
|---|---|---|
| Lifecycle | Steady-state, designed in | Incident-time response |
| Separation mechanism | Traffic routing between instances | Shard allocation across node groups |
| Lead time | Pre-provisioned | Minutes-to-hours (relocation bandwidth limited) |
| Reversibility | Routing-layer re-point | Relocate shards back / merge clusters |
| Blast-radius discipline | Country-level | Country-level (same axis) |
| Typical trigger | Test traffic, canary, maintenance | Cluster saturation, pathological traffic |
Why this beat the rest of the immediate-mitigation playbook¶
The Zalando on-call responder ran the full first-line playbook:
- Apply longer cache expirations.
- Disable non-critical requests.
- Apply lower cluster-wide query-termination thresholds.
- Scale out coordinator nodes.
- Scale out data nodes.
"These actions, however, have not provided even a temporary relief." All five are horizontally adding capacity or lengthening grace periods — they are resource-adding, not isolation-adding. The pathological per-query cost absorbed every added capacity unit as fast as it came online.
The market split is structural — it reduces the load on the remaining cluster by a fixed fraction proportional to the removed market's share, and gives the team a partition in which to run peer-tested healthy traffic to prove the topology is not itself the problem.
When not to use this pattern¶
- When the blast axis isn't geographical. If the bad caller doesn't correlate with a specific market, splitting markets doesn't help — the pathological traffic distributes across both resulting clusters.
- When shard relocation bandwidth is itself the bottleneck. On an already-saturated cluster, sending hundreds of gigabytes of shard data across the cluster to a new node group may aggravate the incident before it alleviates it. Zalando addressed this by reducing replicas first (halving the shard-transfer footprint) and throttling ingestion (no new writes competing for CPU during relocation).
- When the clusters share the ingest pipeline upstream and the upstream is what's pathological. Splitting the storage doesn't help if the misbehaving caller is fanning out to both clusters' upstream.
- When there is no fine-grained node-allocation discipline in place. This pattern assumes indices are already labelled / allocation-filterable per market, or that they can be re- labelled cheaply.
Companion load-shedding levers Zalando rolled in parallel¶
Along with the market split, Zalando's team rolled out a 5-lever cocktail — the split succeeds because it's part of a coordinated response, not a solo action:
- Reduce shard replicas (storage-side) — to shrink the relocation footprint.
- Throttle ingestion to a full stop (storage-side) — no writes during relocation.
- Split the markets (storage-side) — this pattern.
- Presentation-layer control plane (app-side) — turn off non-critical calls, reduce parallel queries per request, lengthen caches for hot queries.
- Search-steering down-sampling (app-side) — sample fewer requests into heavier ML-model integrations and promotion- enrichment flows, fall back to simpler ranking.
Levers 4 and 5 are instances of load shedding at the presentation boundary — app-layer cost reduction downstream bound, coordinated with the storage-tier isolation move.
Seen in¶
- sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search
— canonical wiki instance. Zalando's Search & Browse team
split the two largest markets (which had been co-tenant on a
single ES cluster) by moving the smaller market to a freshly
provisioned node group via
node.attr.marketallocation filters. The split simultaneously (a) relieved the saturated cluster of the smaller market's load, (b) proved the pathology was localised to the larger market's traffic (and therefore to a client-specific issue), and (c) set up the trace-altitude root-cause investigation that identified the internal self-inflicted-DoS caller via systems/lightstep.
Related¶
- concepts/market-group-country-isolation — steady-state sibling primitive at the serving-API tier
- concepts/blast-radius — the discipline the pattern enacts
- concepts/cell-based-architecture — broader family
- concepts/self-inflicted-dos — the failure class this pattern contains
- patterns/market-group-isolation-for-serving-api — steady-state sibling pattern at the serving-API tier
- patterns/cell-based-architecture-for-blast-radius-reduction
- systems/elasticsearch