Skip to content

SYSTEM Cited by 1 source

Zalando Base Search

Identity

Base Search is the bottom-layer Elasticsearch cluster in Zalando's catalog search substrate. It "provides initial candidates using both classic lexical matching and vector search" (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search) and is the authoritative search index over the product catalog.

Topologically, Base Search is an Elasticsearch cluster with a dedicated coordinator-node tier on separate machines from the data nodes, giving the cluster two distinct failure / saturation surfaces:

  • Coordinator nodes — own the orchestration work: scatter requests to relevant shards, gather partial results, run Adaptive Replica Selection to pick the best shard copy, and perform batch / final reduction. They also provide "another caching layer for search results and aggregations."
  • Data nodes — own the shard-storage + per-shard query work: Lucene segment I/O, terms aggregation execution, vector distance calculation.

Role in the catalog-search request path

  1. NER query builder or Catalog API emits an ES search request.
  2. Base Search coordinator accepts the request, does cache lookups, scatters to relevant shards.
  3. Data nodes execute per-shard work on the search thread pool.
  4. Partial results return to the coordinator, which reduces them in batches and returns the final response.

The wrapping Search API is a lightweight presentation-layer component — Zalando describes Base Search as "wrapped by a lightweight Search API component — another presentation layer."

Dual-retrieval posture

Base Search combines:

  • Classic lexical matching — traditional inverted-index boolean queries + BM25 relevance.
  • Vector search — semantic retrieval via dense embeddings, for cases where lexical matching returns sparse results or for hand-off to the newer neural-matching system the NER query builder can promote queries into.

The dual-retrieval shape is the reason the NER query builder also queries Base Search for product counts — it needs to know if lexical retrieval is producing a sparse result set so it can decide whether to expand via neural matching or narrow via implicit filter promotion.

Failure modes observed in 2025-12-16

The 2025-12-16 incident saturated one Base Search cluster (the one serving the two largest markets). Observed:

  • Coordinator CPU pinned — the scatter/gather + partial- reduction work for pathological facet queries consumed the coordinator-node tier.
  • search thread-pool queue overflow"many tasks were being rejected before they could be completed or even accepted, because the queue just maxed out."
  • Tasks index overflow"the cluster was too overloaded to respond to the request. With the cluster being in distress, all queries became slow and the tasks index was overflowing with long-running queries." Attempting to even inspect slow- query samples via the tasks API failed because the task management itself was saturated.

Follow-up operational changes:

Seen in

Last updated · 507 distilled / 1,218 read