Skip to content

SYSTEM Cited by 2 sources

Zalando Catalog Search

Identity

Zalando Catalog Search is the wiki's composite-identity page for the multi-layer search substrate operated by Zalando's Search & Browse team. It is not a single service — it is the end-to-end request path from a user's search action in the app to the Elasticsearch candidate set, spanning four presentation / execution layers and two enrichment sidecars.

The canonical architectural description is sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search, which introduces the layering to explain why a pathological facet query at the bottom layer produces the user-visible "search is slow" and "filters are broken" symptoms at the top.

Layers (bottom-up)

┌──────────────────────────────────────────────────────────┐
│  Catalog API                                              │
│  → systems/zalando-catalog-api                           │
└──────────────────┬───────────────────────────────────────┘
                   │ fan-out 1 request → N queries
┌──────────────────┴───────────────────────────────────────┐
│  NER Query Builder                                        │
│  → systems/zalando-ner-query-builder                     │
└──────────────────┬───────────────────────────────────────┘
┌──────────────────┴───────────────────────────────────────┐
│  Search API   ←   Algorithm Gateway                       │
│                   (user-action + ML re-ranking)           │
│               ←   Promotions Bidding Service              │
│                   (sponsored-result blending)             │
│  → systems/zalando-search-api                            │
└──────────────────┬───────────────────────────────────────┘
┌──────────────────┴───────────────────────────────────────┐
│  Base Search  (Elasticsearch, coordinator + data nodes)   │
│  → systems/zalando-base-search                           │
└──────────────────────────────────────────────────────────┘

Each layer carries its own caches on the hot path:

Layer Cache role
Catalog API Caches popular queries and filter combinations
NER query builder Caches popular queries and filter combinations
Base Search coordinator nodes Caches search results and aggregations (on separate machines from data nodes)

Under normal conditions, facet queries (brand, size, colour, price-bucket aggregations) are well-behaved and benefit from this multi-cache topology. Under load, faceting against high-cardinality fields (SKU, unique product IDs) defeats every cache layer and overloads the coordinator-plus-data-node pair — the precise failure the 2025-12-16 incident canonicalises.

Downstream consumers

  • The customer-facing catalog — the direct search-and- browse surface.
  • The Designer experience — a curated browse view.
  • Full-text search — the app's primary query box.
  • Zalando Assistant — the conversational discovery surface that "depends on us to fetch and recommend products in real time" (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search).
  • Brand partner campaigns — sponsored placements blended via the Promotions Bidding Service; a catalog-search outage is a partner-campaigns outage.

Operational posture

  • Market-group isolation at the Elasticsearch tier. Multiple ES clusters, each serving a subset of countries, such that saturation in one market cluster does not affect other markets. Validated in the 2025-12-16 incident: two of the largest markets co-tenanted on one cluster were saturated; all other market-group clusters remained healthy. Zalando subsequently split the two co-tenant markets into separate clusters during the incident — see patterns/split-cluster-by-market-for-load-isolation.
  • Shared-across-market failure surface. Countries sharing an ES cluster share blast radius; the number of countries per cluster is a tuning knob that trades cluster operational cost against isolation strength.
  • Presentation layers as control plane. During incidents, Catalog API and Search API act as the fast- operator-control surface: turn off non-critical calls, reduce parallel queries per request, increase cache TTL, down-sample heavy ML-model integrations. Canonical instance of load shedding at the presentation boundary.

Load-bearing pathology surfaced on 2025-12-16

A single pathological caller pattern — ~20–100 req/s of terms aggregations on the SKU field, triggered by an internal application's maintenance workload + processing-logic bug — saturated the coordinator CPU and search thread pool on one market-pair cluster. "Queries that usually took milliseconds were now dragging on for seconds, and some requests were timing out altogether. Users started seeing empty result pages, or pages with just a few items." The pathology escaped every cache layer because the filter+SKU combinations were novel per-request.

Follow-up program:

Seen in

Last updated · 507 distilled / 1,218 read