ZALANDO 2025-12-16

Zalando — The Day Our Own Queries DoS’ed Us: Inside Zalando Search¶

Summary¶

Zalando's Search & Browse team publishes a production-incident retrospective on a Sunday-afternoon Elasticsearch meltdown in which the root cause was a self-inflicted denial of service: a legit internal application, triggered by an automated maintenance workload plus a processing-logic bug, started issuing 20–100 requests/s of faceting queries that performed aggregations on a high-cardinality field (SKU, the unique product identifier). Volume was "peanuts" by the cluster's normal standards (thousands of req/s), but each query's per-shard scatter/gather cost on a high-cardinality terms aggregation was so heavy that a small, steady stream was enough to pin coordinator-node CPU and starve the search thread pool — the exact signature of a DoS. The incident surfaced load-bearing observability gaps (no per-client slow-query attribution, no identifier on aggregation queries linking back to the calling service) and anchored a follow-up program of application-side query limiting with dynamically adjustable thresholds, per-client slow-query dashboards via the X-Opaque-Id request header, and cluster-wide search.max_buckets guardrails. The closing metaphor is the clinical-diagnostics aphorism "when you hear hoofbeats, think horses, not zebras" — with the qualifier "sometimes, when you hear hoofbeats, it might just be a zebra": the team was looking for high read load / high write load / infrastructure problems, and missed a low-volume, high-per-query-cost internal caller flying under their volume-based monitoring.

Key takeaways¶

Self-inflicted DoS is a real failure mode and is invisible to volume-based monitoring. The pathological client was sending 20–100 req/s of legitimate-looking Elasticsearch queries against a cluster that routinely handles "thousands of requests per second"; the per-client traffic monitoring that existed "was just too low to attract any attention; it was simply flying under the radar." The damage came not from rate but from per-query cost: high-cardinality faceting aggregations on the SKU field. Canonicalised as concepts/self-inflicted-dos — one internal application + automated trigger + pathological query shape = outage. (Source: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.)
High-cardinality terms aggregations are a DoS vector in Elasticsearch. Zalando's load-bearing theory section explains the mechanics: faceting queries scatter to all relevant shard copies, each shard runs the terms aggregation across all matching docs, and the coordinator gathers + reduces partial results. Parallel collectors (ES 8.12+ search_worker pool) "are not executed with those parallel collectors; they simply ran as very heavy work on the search pool, consuming a lot of CPU and memory. A small number of such pathological facet queries was enough to keep the cluster 'hot' and to starve normal traffic, which is exactly what a DoS looks like in practice." Canonicalised as concepts/high-cardinality-aggregation-overload.
Elasticsearch's soft guardrails exist but were not configured defensively here. The incident names the three soft guardrails Elasticsearch ships: (1) search.max_buckets — caps how many aggregation buckets a single request can create, defending against unbounded-cardinality aggregations. (2) max_result_window (index-level) — "make sure no single request can ask for a 'scroll the universe'-sized result set." (3) Adaptive Replica Selection (ARS) — coordinator picks the "best" shard copy based on past response times and search-thread-pool-queue size. One of Zalando's follow-up runbooks is "applying cluster-wide settings like search.max_buckets to limit the size of aggregations on the whole cluster at once."
The immediate mitigations didn't help until the market was split. The on-call responder's standard playbook — longer cache expirations, disabling non-critical requests, lower cluster-wide query termination thresholds, scaling out coordinator and data nodes — "have not provided even a temporary relief. The cluster remained under significant strain." The breakthrough mitigation was a structural one: splitting the two largest markets into separate clusters using Elasticsearch's node allocation settings to fence which shards live on which nodes, so the problem could be localised to the market whose client was misbehaving. Canonicalised as patterns/split-cluster-by-market-for-load-isolation — an incident-time instance of the steady-state concepts/market-group-country-isolation design primitive Zalando uses in PRAPI.
The 5-lever load-shedding cocktail. In parallel with the market split, the team rolled out five load-shedding actions, half on the ES cluster and half on the application side: (1) reduce shard replicas — "so there would be fewer shards to relocate once we started splitting the markets"; (2) throttle ingestion down to a full stop — no writes during relocation; (3) split the markets — smaller market to new cluster, larger market stays on the original; (4) presentation-layer controls — turn off non-critical calls, reduce parallel-queries-per-request, increase cache effectiveness for hot queries + filter combinations; (5) search steering down-sampling — "sampling fewer requests into some heavier ML model integrations and promotion-enrichment flows, falling back to simpler ranking where needed." The pattern is that presentation layers (Catalog API, Search API, NER query builder) are used as a control plane to reduce load on downstream Elasticsearch — i.e. load shedding at the presentation boundary, not inside Elasticsearch.
Cluster self-recovery without root cause is the dangerous mid-state. "At some point in the evening, the cluster started to recover. The CPU usage began to drop, and the query response times improved. The cluster returned to a stable state." But: "the root cause of the issue was still not understood, so the team continued to investigate. No one was satisfied with just having the cluster back up; they needed to know what had caused the problem in the first place. The incident could resurface at any time if the underlying issue was not addressed." Canonical anti-complacency stance — a green dashboard without a named root cause is latent, not closed. Structurally aligned with the postmortem corpus as institutional memory ethos Zalando SRE publishes elsewhere (sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis).
Lightstep notebook + trace exploration closed the loop. "An exploratory analysis of traces in a Lightstep notebook detected an unusual traffic pattern from one of our internal applications. Further investigation revealed that the application was sending 50 times more queries than usual, and it matched the incident timeline exactly." The diagnostic win came from trace-altitude, not metric-altitude: the per-client fan-out pattern was legible in traces but not in the aggregated cluster metrics. See systems/lightstep.
Five reasons the bug hid. The retrospective enumerates why standard monitoring missed this: (1) queries were syntactically valid — the NER / business layer couldn't reject them; (2) the calling service was legit and not new — no change window or deploy correlation; (3) volume was too low to flag (20–100 req/s vs thousands); (4) slow-query log existed but wasn't analysed per-client — the SKU-facet queries were indistinguishable from legitimate user facets; (5) no client identifier on the queries — the slow-query log captured slow queries but not who sent them, because the application did not propagate an X-Opaque-Id header. This list is canonical evidence that per-client slow-query attribution is a load-bearing operational capability, not a nice-to-have. Canonicalised as patterns/per-client-slow-query-dashboard.
Follow-up program: rate-limit by client type, not by overall volume. "We need to think how we can split and isolate workloads better, applying rate limiting based on the type of the client traffic. Not all clients should be equal, and we might need a more granular access control." Combined with: "We introduced application-side query limiting with dynamically adjustable thresholds, to prevent queries that would try to scan or aggregate too much data." Canonicalised as patterns/application-side-query-limit-with-dynamic-threshold.
The zebra lesson. The closing framing is the clinical-diagnostics aphorism:

"When you hear hoofbeats, think of horses, not zebras. Because horses are common, and zebras are rare. But in our case, it happened to be a zebra. […] We were looking for common causes of Elasticsearch performance issues: high read load, high write load, misconfigurations, infrastructure issues. We were not expecting a self-inflicted DoS attack from an internal application. So keep in mind: sometimes, when you hear hoofbeats, it might just be a zebra." Canonicalised as concepts/zebra-not-horse-heuristic — an investigator's bias-checker when common hypotheses fail to explain the data.

The catalog-search architecture (as described in the post)¶

A single user search traverses roughly five layers; each adds its own caching and enrichment. The post describes them bottom-up:

┌──────────────────────────────────────────────────────────┐
│  Catalog API  (presentation)                              │
│  - fans out 1 request → N queries to Base Search          │
│  - A/B-test-aware (cohort-specific query shapes)          │
│  - owns redirect decisions (SRP vs landing page)          │
│  - issues a SEPARATE facets query for filters             │
│  - caches popular queries / filter combinations           │
└──────────────────────┬───────────────────────────────────┘
                       │
┌──────────────────────┴───────────────────────────────────┐
│  NER Query Builder  (full-text search query builder)      │
│  - consumes user intent (raw text + filters)              │
│  - Named-Entity-Recognition → implicit filter promotion   │
│  - builds the Elasticsearch query                         │
│  - tags result set "sparse" → hand-off to neural matcher  │
│  - separately hits Base Search for PRODUCT COUNTS         │
│    to decide whether to narrow without risking 0 results  │
│  - caches popular queries / filter combinations           │
└──────────────────────┬───────────────────────────────────┘
                       │
┌──────────────────────┴───────────────────────────────────┐
│  Search API  (wraps Base Search)                          │
│  - integrates with Algorithm Gateway                      │
│    (user-action data + rules engine + ML re-ranking)      │
│  - integrates with Promotions Bidding Service             │
│    (sponsored content blended with organic results)       │
└──────────────────────┬───────────────────────────────────┘
                       │
┌──────────────────────┴───────────────────────────────────┐
│  Base Search  (Elasticsearch cluster)                     │
│  - dedicated coordinator nodes                            │
│    (another caching layer for results + aggregations)     │
│  - data nodes running the actual per-shard query work     │
│  - lexical matching + vector search for initial candidates│
└──────────────────────────────────────────────────────────┘

Under normal conditions, facet queries (brand / size / colour / price-bucket aggregations) are well-behaved and benefit from the two cache layers (presentation and coordinator-node). Under load, "a pathological pattern in just this one type of query — facets — can put disproportionate pressure on Elasticsearch and its coordinator nodes, while everything above simply sees 'search is slow' and 'filters are broken'." Hot-path isolation is the design promise this incident violated.

The Elasticsearch thread-pool mechanics (the technical core)¶

Thread pool	Role	Who uses it
`search`	Per-shard query and aggregation execution	All shard-level query work, including heavy `terms` aggregations
`search_coordination`	Lighter orchestration on coordinator: merging partial results, reductions, final response	Every request, on the coordinator node
`search_worker` (ES 8.12+)	Parallel collectors, intra-shard work sliced across segments to reduce latency	Some aggregations + query types, but NOT high-cardinality `terms` aggregations

The post's load-bearing technical observation: "Our incident, however, was driven by high-cardinality terms aggregations, which are not executed with those parallel collectors; they simply ran as very heavy work on the search pool, consuming a lot of CPU and memory." Filling the search queue produces request rejections; a small number of pathological facet queries was "enough to keep the cluster 'hot' and to starve normal traffic" — the DoS signature.

Operational numbers (from the post)¶

Quantity	Value
Normal cluster QPS	"thousands of requests per second"
Pathological client QPS	20–100 req/s
Pathological client vs baseline multiplier	50× its normal volume
Affected cluster scope	2 of Zalando's largest markets (the single shared cluster serving both)
Other market-group clusters	unaffected (market-group isolation held as designed)
Client profile	Internal application, legit, not new, pre-existing in monitoring
Query shape	Faceting (`terms` aggregation) on SKU — unique-product-identifier high-cardinality field
Trigger	Automated maintenance workload + processing-logic bug in the client
Mitigation that worked	Splitting the two markets into separate clusters via node allocation settings + 5-lever application-side load shedding
Root-cause detection venue	Lightstep-trace exploratory notebook

Caveats¶

The affected-cluster CPU number, queue-rejection rate, and customer-facing impact numbers (how many searches failed, for how long) are not disclosed. Customer-impact evidence is app-review quotes; the incident duration is "seemingly ordinary Sunday" to "at some point in the evening" — hours-scale, not quantified to the minute.
The exact Elasticsearch version is not named. The post references "Starting with 8.12" for the search_worker pool, placing Zalando on ES ≥ 8.12 at the time of writing, but whether the cluster was on 8.12-or-later during the incident is not stated.
The client application that originated the pathological queries is named only as "an internal application" — not identified by service name or team, consistent with how Zalando protects internal service names across the blog.
The post does not quantify how much each follow-up action contributed to the resilience envelope (per-client dashboards vs search.max_buckets guardrail vs app-side limiter). The retrospective describes the actions but not their A/B-attributed effect.
The 50× baseline-multiplier claim is a Lightstep-notebook-observed number after the fact; it was not visible in per-client monitoring at the time because that monitoring was volume-gated.

Source¶

systems/elasticsearch — the DoS'd substrate
systems/zalando-catalog-search — the composite system-of-systems described in the post
systems/zalando-base-search — the Elasticsearch cluster with coordinator nodes + data nodes
systems/zalando-catalog-api — presentation layer that fans 1→N queries
systems/zalando-search-api — wraps Base Search, blends Algorithm Gateway + Promotions Bidding
systems/zalando-ner-query-builder — NER-driven query builder that also hits Base Search for product counts
systems/lightstep — the trace-exploration tool that closed the root-cause loop
concepts/self-inflicted-dos — the core failure mode
concepts/high-cardinality-aggregation-overload — the per-query mechanism
concepts/adaptive-replica-selection-elasticsearch — the ARS shard-selection primitive
concepts/x-opaque-id-client-attribution — the observability primitive the team added
concepts/zebra-not-horse-heuristic — the debugging mental model
concepts/scatter-gather-query — the execution shape facets inherit
concepts/market-group-country-isolation — the steady-state isolation primitive that the incident-time market-split instantiates
concepts/blast-radius — what market-group isolation bounds
concepts/tail-latency-spike-during-queueing — the user-visible symptom layer
concepts/load-shedding-at-ingestion — the parent concept the app-side mitigations instantiate
patterns/split-cluster-by-market-for-load-isolation — incident-time isolation via node-allocation settings
patterns/application-side-query-limit-with-dynamic-threshold — the app-side query-cost guardrail
patterns/per-client-slow-query-dashboard — X-Opaque-Id-keyed per-caller slow-query attribution
patterns/cluster-wide-aggregation-guardrail — search.max_buckets cluster-level limit
companies/zalando