SYSTEM Cited by 2 sources
Zalando Catalog Search¶
Identity¶
Zalando Catalog Search is the wiki's composite-identity page for the multi-layer search substrate operated by Zalando's Search & Browse team. It is not a single service — it is the end-to-end request path from a user's search action in the app to the Elasticsearch candidate set, spanning four presentation / execution layers and two enrichment sidecars.
The canonical architectural description is sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search, which introduces the layering to explain why a pathological facet query at the bottom layer produces the user-visible "search is slow" and "filters are broken" symptoms at the top.
Layers (bottom-up)¶
┌──────────────────────────────────────────────────────────┐
│ Catalog API │
│ → systems/zalando-catalog-api │
└──────────────────┬───────────────────────────────────────┘
│ fan-out 1 request → N queries
┌──────────────────┴───────────────────────────────────────┐
│ NER Query Builder │
│ → systems/zalando-ner-query-builder │
└──────────────────┬───────────────────────────────────────┘
│
┌──────────────────┴───────────────────────────────────────┐
│ Search API ← Algorithm Gateway │
│ (user-action + ML re-ranking) │
│ ← Promotions Bidding Service │
│ (sponsored-result blending) │
│ → systems/zalando-search-api │
└──────────────────┬───────────────────────────────────────┘
│
┌──────────────────┴───────────────────────────────────────┐
│ Base Search (Elasticsearch, coordinator + data nodes) │
│ → systems/zalando-base-search │
└──────────────────────────────────────────────────────────┘
Each layer carries its own caches on the hot path:
| Layer | Cache role |
|---|---|
| Catalog API | Caches popular queries and filter combinations |
| NER query builder | Caches popular queries and filter combinations |
| Base Search coordinator nodes | Caches search results and aggregations (on separate machines from data nodes) |
Under normal conditions, facet queries (brand, size, colour, price-bucket aggregations) are well-behaved and benefit from this multi-cache topology. Under load, faceting against high-cardinality fields (SKU, unique product IDs) defeats every cache layer and overloads the coordinator-plus-data-node pair — the precise failure the 2025-12-16 incident canonicalises.
Downstream consumers¶
- The customer-facing catalog — the direct search-and- browse surface.
- The Designer experience — a curated browse view.
- Full-text search — the app's primary query box.
- Zalando Assistant — the conversational discovery surface that "depends on us to fetch and recommend products in real time" (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search).
- Brand partner campaigns — sponsored placements blended via the Promotions Bidding Service; a catalog-search outage is a partner-campaigns outage.
Operational posture¶
- Market-group isolation at the Elasticsearch tier. Multiple ES clusters, each serving a subset of countries, such that saturation in one market cluster does not affect other markets. Validated in the 2025-12-16 incident: two of the largest markets co-tenanted on one cluster were saturated; all other market-group clusters remained healthy. Zalando subsequently split the two co-tenant markets into separate clusters during the incident — see patterns/split-cluster-by-market-for-load-isolation.
- Shared-across-market failure surface. Countries sharing an ES cluster share blast radius; the number of countries per cluster is a tuning knob that trades cluster operational cost against isolation strength.
- Presentation layers as control plane. During incidents, Catalog API and Search API act as the fast- operator-control surface: turn off non-critical calls, reduce parallel queries per request, increase cache TTL, down-sample heavy ML-model integrations. Canonical instance of load shedding at the presentation boundary.
Load-bearing pathology surfaced on 2025-12-16¶
A single pathological caller pattern — ~20–100 req/s of
terms aggregations on the SKU field, triggered by an internal
application's maintenance workload + processing-logic bug —
saturated the coordinator CPU and search thread pool on one
market-pair cluster. "Queries that usually took milliseconds
were now dragging on for seconds, and some requests were timing
out altogether. Users started seeing empty result pages, or
pages with just a few items." The pathology escaped every cache
layer because the filter+SKU combinations were novel per-request.
Follow-up program:
- App-side query limiter with dynamically adjustable thresholds — patterns/application-side-query-limit-with-dynamic-threshold.
- Per-client slow-query dashboards via
X-Opaque-Id— patterns/per-client-slow-query-dashboard. - Cluster-wide aggregation guardrail via
search.max_buckets— patterns/cluster-wide-aggregation-guardrail. - Tighter market-level workload isolation and per-client rate limiting as a runbooks / playbook extension.
Seen in¶
- sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical post-mortem introducing this composite system and its 2025-12-16 self-inflicted-DoS incident.
- sources/2026-03-16-zalando-search-quality-assurance-with-ai-as-a-judge — pre-launch validation instance. Zalando's Search Quality Framework is the offline LLM-as-a-judge evaluation substrate for this stack during new-market launches. Framework-under-test is the full catalog-search stack (NER + Search API + Base Search) wired with the target-market's locale / translations; the judge scores the stack's result sets on a 0–4 rubric before real users see them. Canonical wiki instance of concepts/pre-launch-market-validation applied to this substrate.
Related¶
- systems/elasticsearch — the storage substrate
- systems/zalando-base-search — the ES-cluster wrapper layer
- systems/zalando-catalog-api — top presentation layer
- systems/zalando-search-api — wraps Base Search, blends Algorithm Gateway + Promotions
- systems/zalando-ner-query-builder — intent-parsing + query-building middle layer
- systems/zalando-algorithm-gateway — enrichment sidecar
- systems/zalando-promotions-bidding — sponsored-content blender
- systems/zalando-assistant — conversational surface
- systems/zalando-search-quality-framework — offline LLM-as-judge evaluation framework for this stack
- systems/zalando-search-query-clustering — upstream NER-clustered test-query generator
- concepts/self-inflicted-dos — the failure mode canonicalised from this system
- concepts/high-cardinality-aggregation-overload — the per-query mechanism
- concepts/pre-launch-market-validation — how new countries are validated on this substrate
- companies/zalando