Skip to content

CONCEPT Cited by 1 source

Adaptive Replica Selection (Elasticsearch)

Definition

Adaptive Replica Selection (ARS) is Elasticsearch's coordinator-node shard-copy selection algorithm: when a query's shard has multiple copies (primary + replicas), the coordinator picks "the 'best' shard copy based on past response times and the node's search thread-pool queue size" (sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search). The effect is load-aware routing inside the cluster: slower or more-queued shard copies are avoided in favour of faster ones.

Structurally, ARS is Elasticsearch's in-cluster analogue of the power-of-two-choices family — pick amongst candidates using a freshness + queue-depth signal, not round-robin.

What it protects against

  • Uneven per-node load — a node briefly busy (GC pause, disk contention, big merge) is avoided for the duration of the spike; subsequent queries route to replicas.
  • Head-of-line blocking inside a shard copy's queue — if one replica's search queue is deep, the coordinator prefers a shallower one.
  • Slow-node propagation — without ARS, every query that needs data on a slow node waits on the slow node. ARS lets traffic shift to replicas automatically.

What it does NOT protect against

ARS is a routing primitive, not an admission-control primitive. Once the coordinator has accepted a query for execution, ARS cannot:

  • Reject pathological queries. If every shard copy has the same query shape (all facet queries on SKU), ARS picks the least-bad copy, but all copies remain on fire. The cluster load stays pinned.
  • Defend against high-cardinality aggregation overload. The cost is per-query, not per-node; routing to a different replica just moves the problem, not the aggregate.
  • Solve cluster-wide saturation. ARS rebalances the distribution of work when some nodes are healthier than others. When all nodes are saturated (Zalando's 2025-12-16 incident), there is no "best" copy — everything is hot.

The Zalando incident explicitly notes ARS as a load-balancing primitive, but the mitigation path ran through app-side query limiting + cluster splits, not through ARS tuning, because the problem class is admission, not routing.

Interaction with coordinator-node caching

The coordinator layer also caches aggregation + search results. ARS routes the cache-miss paths — cache hits are served from the coordinator without picking a shard copy at all. So ARS's load-balancing leverage shrinks as cache hit rate rises, and disappears entirely on pathological queries that defeat the cache (as in the 2025-12-16 incident — novel filter+SKU combinations that missed every cache layer).

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — Zalando's theory section on why the incident's facet queries were specifically pathological names ARS as the mechanism by which the coordinator tries to route around slow shard copies. ARS did not rescue the cluster because all shard copies were saturated by the pathological terms aggregations simultaneously — a structural limit: routing doesn't save you from admission mistakes.
Last updated · 507 distilled / 1,218 read