Skip to content

SYSTEM Cited by 7 sources

Elasticsearch

Elasticsearch is the Apache-Lucene-based distributed search engine that powers a large share of production full-text + filtered search at scale. It exposes a JSON-based Query DSL; the query document family relevant to most structured search products is the bool query, which takes nested must / should / must_not / filter / should_not clauses corresponding naturally to AND / OR / NOT logic.

Within the wiki this page is a stub created for cross-referencing from sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean; Elasticsearch is a large product with many capabilities not covered here (search relevance, aggregations, k-NN vector search, ILM / snapshotting, cross-cluster replication).

AWS's managed fork is Amazon OpenSearch Service, which uses the same bool-query shape.

Role in the wiki

Backing store for GitHub Issues search (2025-05-13)

GitHub Issues search is backed by Elasticsearch. The 2025 rewrite's Query pipeline stage compiles an AST from user search input into a nested Elasticsearch bool query:

AST node Elasticsearch bool clause
AND must
OR should
NOT should_not (or must_not)
leaf filter-term (author:monalisa) term / terms / prefix

The recursive mapping is the natural codomain for an AST-driven search DSL: patterns/ast-based-query-generation is the structural fit. A worked before-after is in the source page. Same-field OR-of-values subtrees get compacted into a single terms clause as an intra- leaf optimization.

Scale: GitHub Issues search runs at ~2,000 QPS (≈160 M queries/day) on this substrate. (Source: sources/2025-05-13-github-github-issues-search-now-supports-nested-queries-and-boolean)

Bool query shape (the relevant API surface)

The canonical nested shape Elasticsearch exposes:

{
  "query": {
    "bool": {
      "must":     [ ... ],   // AND: all must match
      "should":   [ ... ],   // OR-like: used for scoring, or for match when no must
      "must_not": [ ... ],   // NOT: none must match
      "filter":   [ ... ]    // AND with no scoring contribution
    }
  }
}

bool clauses nest inside each other, which is why any boolean-algebra AST can be emitted mechanically as a tree of bool objects with leaf clauses at the bottom.

The filter vs must distinction matters for relevance scoring: filter short-circuits the score computation, which is the right choice for structured-equality predicates (state:open, author_id:X). Full-text term queries typically go in must to participate in scoring.

See the upstream reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Cross Cluster Replication (CCR) — replication between clusters

Elasticsearch supports two distinct replication mechanisms:

  • Intra-cluster primary/replica shard replication — primary and replica shards of the same index live in one cluster; ES rebalances shards across the cluster's nodes as a health action.
  • Cross Cluster Replication (CCR) — one-way leader→follower replication between otherwise-independent ES clusters, at the Lucene segment granularity. Covered fully at concepts/cross-cluster-replication.

CCR's structural win is that it lets you align the storage topology to the application's primary/replica topology (see concepts/primary-replica-topology-alignment). The canonical wiki instance is the 2026-03-03 GHES search rewrite — GHES collapsed a multi-node ES cluster spanning its HA pair into per-node single-node clusters and joined them with CCR, removing ES's freedom to rebalance primary shards onto the read-only replica host (the old failure mode that caused mutual-dependency deadlocks). See patterns/single-node-cluster-per-app-replica.

CCR's auto-follow policy is new-only — it matches indexes created after the policy is installed and doesn't retroactively attach pre-existing indexes. Applying CCR to a long-lived deployment therefore requires an imperative bootstrap step for pre-existing indexes followed by the declarative auto-follow policy for future ones.

CCR only covers document replication. Failover, index deletion, and upgrades are the consumer's responsibility — "Elasticsearch only handles the document replication, and we're responsible for the rest of the index's lifecycle" (GitHub, 2026-03-03).

Shard-allocation awareness + drain livelock (2024-06-20)

The canonicalises a tricky interaction between three Elasticsearch primitives:

  • Shard-allocation awareness — Elasticsearch is configured with an awareness.attributes value (AZ, rack, host) and refuses to place two copies of a shard on nodes sharing the attribute value.
  • cluster.routing.allocation.exclude._ip — the runtime-mutable exclusion list used to tell Elasticsearch "don't place shards on these node IPs"; used by operators as the drain primitive.
  • Zone-spread invariant — the combination of the above refuses to relocate a node's shards when the node is the only one in its awareness group (e.g. only pod in an AZ).

The drain-stuck-on-last-pod-in-zone failure mode (concepts/zone-aware-shard-allocation-stuck-drain) is inherent — Elasticsearch is correct to refuse the violation — but it turns a livelock into a correctness-critical operator-pattern concern. Combined with the Zalando-disclosed zombie-exclusion-list partial-failure bug in es-operator, this produced three consecutive morning scale-out failures at Zalando Lounge and the canonical wiki posture of "read the source code of the orchestrating operator when the abstract model doesn't explain the symptom."

Stub caveats

  • This page covers only what the ingested sources touch. Not covered: relevance/_score tuning, aggregations, ILM, k-NN vector search, Elasticsearch SQL, cross-cluster replication, snapshot lifecycle, or operational runbooks (shard sizing, circuit breakers, mapping explosion).
  • The open-source / licensing split between Elasticsearch (Elastic) and OpenSearch (AWS fork) is not modelled here; the bool-query DSL is common to both.

Seen in

Seen in (migration off Elasticsearch)

Seen in (legacy / archetypal side)

  • sources/2025-10-12-mongodb-cars24-improves-search-for-300-million-users-with-atlas — post names "bolt-on search engine (such as Elasticsearch)" as the canonical example of the legacy search shape Cars24 left to consolidate on Atlas + Atlas Search. Cars24 had multiple engineering teams piping data into a single search index with race-logic + real-time-dashboard-update inefficiencies. The class is archetypal, not Cars24-specific; the wiki treats this as one instance of the synchronization-tax shape.

Seen in (operator drain failures)

  • — canonical wiki instance of shard-allocation awareness as a drain-livelock producer. Zalando Lounge runs a 3-AZ Elasticsearch cluster on Kubernetes via es-operator; zone-aware shard allocation refused to relocate shards from the last pod in one AZ, stalling the operator's 999-retry drain loop overnight. Two es-operator bugs uncovered by the trace-through-source code session: (1) ctx-cancellation ignored in one retry loop (PR #405), (2) zombie exclusion-list state when drain is interrupted between mark and cleanup (WIP PR #423). Closing lesson "Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough."

— canonical wiki instance of nested-document indexing for cross-modality query. Netflix's multimodal video-search index stores each temporal bucket as a root Elasticsearch document (associated_ids, time_bucket_start_ns, time_bucket_end_ns) with source_annotations typed nested carrying heterogeneous per-modality child docs (CHARACTER_SEARCH with label; SCENE_SEARCH with label + embedding_vector). Document _id is the composite (asset_id, time_bucket) making model re-runs idempotent via composite-key upsert. The nested shape preserves cross-annotation-within-same-bucket semantics — "find buckets where a character with label Joey co-occurs with a scene annotation with label kitchen" — that a flat document model can't express. Netflix's framing: "this hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale." See patterns/nested-elasticsearch-for-multimodal-query and patterns/three-stage-ingest-fusion-index.

Seen in (self-inflicted DoS via high-cardinality faceting)

Seen in (catalog discovery on a metadata graph)

  • sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph — canonical wiki instance of single unified entities index with entityType discriminator for an ML metadata catalog. Netflix MDS indexes models, features, pipelines, datasets, and A/B tests in one entities index (differentiated by an entityType field) plus a separate owners index. Free-text search over a model name becomes one query against one index; faceted filters on entityType, ownership, tags, and domain-specific attributes (stored as key-value tag pairs like team::personalization, env::production, model.state::released) compose post-search. Relevance boosting ensures exact name matches score significantly higher than fuzzy/related-metadata matches — the canonical search-quality lever for catalog UX. ES is the discovery surface in MDS's dual-store pattern; Datomic is the navigation surface for multi-hop graph traversal queries. Re-indexing is triggered both synchronously on the ingest path (after the Datomic write) and asynchronously on enrichment (background jobs that derive new edges re-index the affected entities so the relationship-metadata related field stays current).
Last updated · 542 distilled / 1,571 read