Zalando — The Day Our Own Queries DoS’ed Us: Inside Zalando Search¶
Summary¶
Zalando's Search & Browse team publishes a production-incident
retrospective on a Sunday-afternoon Elasticsearch meltdown in which
the root cause was a self-inflicted denial of service: a legit
internal application, triggered by an automated maintenance workload
plus a processing-logic bug, started issuing 20–100 requests/s of
faceting queries that performed aggregations on a high-cardinality
field (SKU, the unique product identifier). Volume was "peanuts"
by the cluster's normal standards (thousands of req/s), but each
query's per-shard scatter/gather
cost on a high-cardinality terms aggregation was so heavy that a
small, steady stream was enough to pin coordinator-node CPU and
starve the search thread pool — the exact signature of a DoS.
The incident surfaced load-bearing observability gaps (no per-client
slow-query attribution, no identifier on aggregation queries linking
back to the calling service) and anchored a follow-up program of
application-side query limiting with dynamically adjustable
thresholds, per-client slow-query dashboards via the
X-Opaque-Id
request header, and cluster-wide
search.max_buckets
guardrails. The closing metaphor is the clinical-diagnostics
aphorism "when you hear hoofbeats, think horses, not zebras" —
with the qualifier "sometimes, when you hear hoofbeats, it might
just be a zebra": the team was looking for high read load / high
write load / infrastructure problems, and missed a low-volume,
high-per-query-cost internal caller flying under their
volume-based monitoring.
Key takeaways¶
- Self-inflicted DoS is a real failure mode and is invisible to volume-based monitoring. The pathological client was sending 20–100 req/s of legitimate-looking Elasticsearch queries against a cluster that routinely handles "thousands of requests per second"; the per-client traffic monitoring that existed "was just too low to attract any attention; it was simply flying under the radar." The damage came not from rate but from per-query cost: high-cardinality faceting aggregations on the SKU field. Canonicalised as concepts/self-inflicted-dos — one internal application + automated trigger + pathological query shape = outage. (Source: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.)
- High-cardinality terms aggregations are a DoS vector in
Elasticsearch. Zalando's load-bearing theory section explains
the mechanics: faceting queries scatter to all relevant shard
copies, each shard runs the
termsaggregation across all matching docs, and the coordinator gathers + reduces partial results. Parallel collectors (ES 8.12+search_workerpool) "are not executed with those parallel collectors; they simply ran as very heavy work on thesearchpool, consuming a lot of CPU and memory. A small number of such pathological facet queries was enough to keep the cluster 'hot' and to starve normal traffic, which is exactly what a DoS looks like in practice." Canonicalised as concepts/high-cardinality-aggregation-overload. - Elasticsearch's soft guardrails exist but were not configured
defensively here. The incident names the three soft guardrails
Elasticsearch ships:
(1)
search.max_buckets— caps how many aggregation buckets a single request can create, defending against unbounded-cardinality aggregations. (2)max_result_window(index-level) — "make sure no single request can ask for a 'scroll the universe'-sized result set." (3) Adaptive Replica Selection (ARS) — coordinator picks the "best" shard copy based on past response times and search-thread-pool-queue size. One of Zalando's follow-up runbooks is "applying cluster-wide settings like search.max_buckets to limit the size of aggregations on the whole cluster at once." - The immediate mitigations didn't help until the market was split. The on-call responder's standard playbook — longer cache expirations, disabling non-critical requests, lower cluster-wide query termination thresholds, scaling out coordinator and data nodes — "have not provided even a temporary relief. The cluster remained under significant strain." The breakthrough mitigation was a structural one: splitting the two largest markets into separate clusters using Elasticsearch's node allocation settings to fence which shards live on which nodes, so the problem could be localised to the market whose client was misbehaving. Canonicalised as patterns/split-cluster-by-market-for-load-isolation — an incident-time instance of the steady-state concepts/market-group-country-isolation design primitive Zalando uses in PRAPI.
- The 5-lever load-shedding cocktail. In parallel with the market split, the team rolled out five load-shedding actions, half on the ES cluster and half on the application side: (1) reduce shard replicas — "so there would be fewer shards to relocate once we started splitting the markets"; (2) throttle ingestion down to a full stop — no writes during relocation; (3) split the markets — smaller market to new cluster, larger market stays on the original; (4) presentation-layer controls — turn off non-critical calls, reduce parallel-queries-per-request, increase cache effectiveness for hot queries + filter combinations; (5) search steering down-sampling — "sampling fewer requests into some heavier ML model integrations and promotion-enrichment flows, falling back to simpler ranking where needed." The pattern is that presentation layers (Catalog API, Search API, NER query builder) are used as a control plane to reduce load on downstream Elasticsearch — i.e. load shedding at the presentation boundary, not inside Elasticsearch.
- Cluster self-recovery without root cause is the dangerous mid-state. "At some point in the evening, the cluster started to recover. The CPU usage began to drop, and the query response times improved. The cluster returned to a stable state." But: "the root cause of the issue was still not understood, so the team continued to investigate. No one was satisfied with just having the cluster back up; they needed to know what had caused the problem in the first place. The incident could resurface at any time if the underlying issue was not addressed." Canonical anti-complacency stance — a green dashboard without a named root cause is latent, not closed. Structurally aligned with the postmortem corpus as institutional memory ethos Zalando SRE publishes elsewhere (sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis).
- Lightstep notebook + trace exploration closed the loop. "An exploratory analysis of traces in a Lightstep notebook detected an unusual traffic pattern from one of our internal applications. Further investigation revealed that the application was sending 50 times more queries than usual, and it matched the incident timeline exactly." The diagnostic win came from trace-altitude, not metric-altitude: the per-client fan-out pattern was legible in traces but not in the aggregated cluster metrics. See systems/lightstep.
- Five reasons the bug hid. The retrospective enumerates
why standard monitoring missed this:
(1) queries were syntactically valid — the NER / business
layer couldn't reject them;
(2) the calling service was legit and not new — no change
window or deploy correlation;
(3) volume was too low to flag (20–100 req/s vs thousands);
(4) slow-query log existed but wasn't analysed per-client —
the SKU-facet queries were indistinguishable from legitimate
user facets;
(5) no client identifier on the queries — the slow-query log
captured slow queries but not who sent them, because the
application did not propagate an
X-Opaque-Idheader. This list is canonical evidence that per-client slow-query attribution is a load-bearing operational capability, not a nice-to-have. Canonicalised as patterns/per-client-slow-query-dashboard. - Follow-up program: rate-limit by client type, not by overall volume. "We need to think how we can split and isolate workloads better, applying rate limiting based on the type of the client traffic. Not all clients should be equal, and we might need a more granular access control." Combined with: "We introduced application-side query limiting with dynamically adjustable thresholds, to prevent queries that would try to scan or aggregate too much data." Canonicalised as patterns/application-side-query-limit-with-dynamic-threshold.
- The zebra lesson. The closing framing is the
clinical-diagnostics aphorism:
"When you hear hoofbeats, think of horses, not zebras. Because horses are common, and zebras are rare. But in our case, it happened to be a zebra. […] We were looking for common causes of Elasticsearch performance issues: high read load, high write load, misconfigurations, infrastructure issues. We were not expecting a self-inflicted DoS attack from an internal application. So keep in mind: sometimes, when you hear hoofbeats, it might just be a zebra." Canonicalised as concepts/zebra-not-horse-heuristic — an investigator's bias-checker when common hypotheses fail to explain the data.
The catalog-search architecture (as described in the post)¶
A single user search traverses roughly five layers; each adds its own caching and enrichment. The post describes them bottom-up:
┌──────────────────────────────────────────────────────────┐
│ Catalog API (presentation) │
│ - fans out 1 request → N queries to Base Search │
│ - A/B-test-aware (cohort-specific query shapes) │
│ - owns redirect decisions (SRP vs landing page) │
│ - issues a SEPARATE facets query for filters │
│ - caches popular queries / filter combinations │
└──────────────────────┬───────────────────────────────────┘
│
┌──────────────────────┴───────────────────────────────────┐
│ NER Query Builder (full-text search query builder) │
│ - consumes user intent (raw text + filters) │
│ - Named-Entity-Recognition → implicit filter promotion │
│ - builds the Elasticsearch query │
│ - tags result set "sparse" → hand-off to neural matcher │
│ - separately hits Base Search for PRODUCT COUNTS │
│ to decide whether to narrow without risking 0 results │
│ - caches popular queries / filter combinations │
└──────────────────────┬───────────────────────────────────┘
│
┌──────────────────────┴───────────────────────────────────┐
│ Search API (wraps Base Search) │
│ - integrates with Algorithm Gateway │
│ (user-action data + rules engine + ML re-ranking) │
│ - integrates with Promotions Bidding Service │
│ (sponsored content blended with organic results) │
└──────────────────────┬───────────────────────────────────┘
│
┌──────────────────────┴───────────────────────────────────┐
│ Base Search (Elasticsearch cluster) │
│ - dedicated coordinator nodes │
│ (another caching layer for results + aggregations) │
│ - data nodes running the actual per-shard query work │
│ - lexical matching + vector search for initial candidates│
└──────────────────────────────────────────────────────────┘
Under normal conditions, facet queries (brand / size / colour / price-bucket aggregations) are well-behaved and benefit from the two cache layers (presentation and coordinator-node). Under load, "a pathological pattern in just this one type of query — facets — can put disproportionate pressure on Elasticsearch and its coordinator nodes, while everything above simply sees 'search is slow' and 'filters are broken'." Hot-path isolation is the design promise this incident violated.
The Elasticsearch thread-pool mechanics (the technical core)¶
| Thread pool | Role | Who uses it |
|---|---|---|
search |
Per-shard query and aggregation execution | All shard-level query work, including heavy terms aggregations |
search_coordination |
Lighter orchestration on coordinator: merging partial results, reductions, final response | Every request, on the coordinator node |
search_worker (ES 8.12+) |
Parallel collectors, intra-shard work sliced across segments to reduce latency | Some aggregations + query types, but NOT high-cardinality terms aggregations |
The post's load-bearing technical observation: "Our incident,
however, was driven by high-cardinality terms aggregations, which
are not executed with those parallel collectors; they simply
ran as very heavy work on the search pool, consuming a lot of
CPU and memory." Filling the search queue produces request
rejections; a small number of pathological facet queries was
"enough to keep the cluster 'hot' and to starve normal traffic" —
the DoS signature.
Operational numbers (from the post)¶
| Quantity | Value |
|---|---|
| Normal cluster QPS | "thousands of requests per second" |
| Pathological client QPS | 20–100 req/s |
| Pathological client vs baseline multiplier | 50× its normal volume |
| Affected cluster scope | 2 of Zalando's largest markets (the single shared cluster serving both) |
| Other market-group clusters | unaffected (market-group isolation held as designed) |
| Client profile | Internal application, legit, not new, pre-existing in monitoring |
| Query shape | Faceting (terms aggregation) on SKU — unique-product-identifier high-cardinality field |
| Trigger | Automated maintenance workload + processing-logic bug in the client |
| Mitigation that worked | Splitting the two markets into separate clusters via node allocation settings + 5-lever application-side load shedding |
| Root-cause detection venue | Lightstep-trace exploratory notebook |
Caveats¶
- The affected-cluster CPU number, queue-rejection rate, and customer-facing impact numbers (how many searches failed, for how long) are not disclosed. Customer-impact evidence is app-review quotes; the incident duration is "seemingly ordinary Sunday" to "at some point in the evening" — hours-scale, not quantified to the minute.
- The exact Elasticsearch version is not named. The post
references "Starting with 8.12" for the
search_workerpool, placing Zalando on ES ≥ 8.12 at the time of writing, but whether the cluster was on 8.12-or-later during the incident is not stated. - The client application that originated the pathological queries is named only as "an internal application" — not identified by service name or team, consistent with how Zalando protects internal service names across the blog.
- The post does not quantify how much each follow-up action
contributed to the resilience envelope (per-client dashboards vs
search.max_bucketsguardrail vs app-side limiter). The retrospective describes the actions but not their A/B-attributed effect. - The 50× baseline-multiplier claim is a Lightstep-notebook-observed number after the fact; it was not visible in per-client monitoring at the time because that monitoring was volume-gated.
Source¶
- Original: https://engineering.zalando.com/posts/2025/12/we-hacked-ourselves-so-you-dont-have-to.html
- Raw markdown:
raw/zalando/2025-12-16-the-day-our-own-queries-dosed-us-inside-zalando-search-8093b840.md
Related¶
- systems/elasticsearch — the DoS'd substrate
- systems/zalando-catalog-search — the composite system-of-systems described in the post
- systems/zalando-base-search — the Elasticsearch cluster with coordinator nodes + data nodes
- systems/zalando-catalog-api — presentation layer that fans 1→N queries
- systems/zalando-search-api — wraps Base Search, blends Algorithm Gateway + Promotions Bidding
- systems/zalando-ner-query-builder — NER-driven query builder that also hits Base Search for product counts
- systems/lightstep — the trace-exploration tool that closed the root-cause loop
- concepts/self-inflicted-dos — the core failure mode
- concepts/high-cardinality-aggregation-overload — the per-query mechanism
- concepts/adaptive-replica-selection-elasticsearch — the ARS shard-selection primitive
- concepts/x-opaque-id-client-attribution — the observability primitive the team added
- concepts/zebra-not-horse-heuristic — the debugging mental model
- concepts/scatter-gather-query — the execution shape facets inherit
- concepts/market-group-country-isolation — the steady-state isolation primitive that the incident-time market-split instantiates
- concepts/blast-radius — what market-group isolation bounds
- concepts/tail-latency-spike-during-queueing — the user-visible symptom layer
- concepts/load-shedding-at-ingestion — the parent concept the app-side mitigations instantiate
- patterns/split-cluster-by-market-for-load-isolation — incident-time isolation via node-allocation settings
- patterns/application-side-query-limit-with-dynamic-threshold — the app-side query-cost guardrail
- patterns/per-client-slow-query-dashboard — X-Opaque-Id-keyed per-caller slow-query attribution
- patterns/cluster-wide-aggregation-guardrail —
search.max_bucketscluster-level limit - companies/zalando