SYSTEM Cited by 1 source
Zalando Base Search¶
Identity¶
Base Search is the bottom-layer Elasticsearch cluster in Zalando's catalog search substrate. It "provides initial candidates using both classic lexical matching and vector search" (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search) and is the authoritative search index over the product catalog.
Topologically, Base Search is an Elasticsearch cluster with a dedicated coordinator-node tier on separate machines from the data nodes, giving the cluster two distinct failure / saturation surfaces:
- Coordinator nodes — own the orchestration work: scatter requests to relevant shards, gather partial results, run Adaptive Replica Selection to pick the best shard copy, and perform batch / final reduction. They also provide "another caching layer for search results and aggregations."
- Data nodes — own the shard-storage + per-shard query work:
Lucene segment I/O,
termsaggregation execution, vector distance calculation.
Role in the catalog-search request path¶
- NER query builder or Catalog API emits an ES search request.
- Base Search coordinator accepts the request, does cache lookups, scatters to relevant shards.
- Data nodes execute per-shard work on the
searchthread pool. - Partial results return to the coordinator, which reduces them in batches and returns the final response.
The wrapping Search API is a lightweight presentation-layer component — Zalando describes Base Search as "wrapped by a lightweight Search API component — another presentation layer."
Dual-retrieval posture¶
Base Search combines:
- Classic lexical matching — traditional inverted-index boolean queries + BM25 relevance.
- Vector search — semantic retrieval via dense embeddings, for cases where lexical matching returns sparse results or for hand-off to the newer neural-matching system the NER query builder can promote queries into.
The dual-retrieval shape is the reason the NER query builder also queries Base Search for product counts — it needs to know if lexical retrieval is producing a sparse result set so it can decide whether to expand via neural matching or narrow via implicit filter promotion.
Failure modes observed in 2025-12-16¶
The 2025-12-16 incident saturated one Base Search cluster (the one serving the two largest markets). Observed:
- Coordinator CPU pinned — the scatter/gather + partial- reduction work for pathological facet queries consumed the coordinator-node tier.
searchthread-pool queue overflow — "many tasks were being rejected before they could be completed or even accepted, because the queue just maxed out."- Tasks index overflow — "the cluster was too overloaded to respond to the request. With the cluster being in distress, all queries became slow and the tasks index was overflowing with long-running queries." Attempting to even inspect slow- query samples via the tasks API failed because the task management itself was saturated.
Follow-up operational changes:
- Cluster-level
search.max_bucketsguardrail via new runbook — patterns/cluster-wide-aggregation-guardrail. - Per-client slow-query attribution via
X-Opaque-Id— concepts/x-opaque-id-client-attribution.
Seen in¶
- sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance.