Skip to content

PATTERN Cited by 1 source

Fewer, larger shards for latency-sensitive workloads

Intent

In a coordinator-fan-out search system (OpenSearch / Elasticsearch / any scatter-gather index) tuned for latency-sensitive document search with effective pre-filters, shrink the shard count and grow the per- shard size instead of following the vendor's throughput-workload shard-sizing guideline. The coordinator's fan-out/collect cost is the dominant term once filters cut the per-shard cost to near-zero, and reducing N (fan-out width) directly attacks p50, p99, and max QPS together.

Context

OpenSearch / Elasticsearch exposes three handles that interact:

  • Shard count per index — fan-out width per query.
  • Shard size — work each worker does per query.
  • Node CPU/RAM/disk mix — throughput ceiling per worker.

AWS's published sizing guidance (shards <50 GB, ~1 shard per 1.5 CPUs) is optimised for log-style throughput workloads with wide scans and no strong filters. For those workloads, fan-out is a feature: more shards parallelise the scan.

For latency-sensitive document search where filters eliminate the majority of docs per shard (so per-shard work is already small) the coordinator cost of collecting and sorting results from N shards grows faster than any per-shard speedup, and the tail latency of the slowest-of-N shard drags the whole query.

Mechanism

  1. Confirm pre-filters are effective (OpenSearch query profiler shows only a few hundred docs visited per shard per query).
  2. Measure end-to-end coordinator-view latency, not per-shard latency (concepts/metric-granularity-mismatch).
  3. Shrink shard count; grow shard size. Rerun load test with a custom harness.
  4. Watch for the counterintuitive signal: P50 should decrease too, not just the tail. If only the tail moves you haven't removed enough fan-out.

Canonical instance (Figma, 2026)

Figma's main search index went from 450 shards → 180 shards (−60%) for the same corpus. Results on their coordinator-view metrics:

  • ≥50% boost in max query rate before excess latency set in.
  • P50 latency decreased, not just p99 — direct evidence that coordinator collection cost had been dominating even the median.
  • Filter-effectiveness meant per-shard workers weren't under-utilized at the larger per-shard size: "our filters were very effective, we actually saw better performance with fewer, larger shards."

Paired with index-size reduction (50% then 90%) so the live set fit in OS disk cache — cache-hit-rate stability is what unlocked the larger-shard math (see concepts/cache-locality).

Why it's counterintuitive

  • Parallelism intuition says more shards = faster. True for scan workloads; false once per-shard work is filter-dominated and coordinator fan-out is the bottleneck.
  • Vendor defaults and docs point the other way — AWS's sizing guidance was explicitly calibrated on throughput-intensive workloads and did not fit Figma's latency-sensitive shape. Measuring beat the recommendation.
  • Bigger shards sound like bigger blast radius. They are — but in a latency-sensitive deployment with RAM sized for disk-cache residency, the operational risk is managed, and the latency win is immediate.

Preconditions / when it applies

  • Latency-sensitive workload (not log search / analytics scan).
  • Pre-filters cut per-shard doc count to hundreds not millions.
  • Sufficient RAM that the reduced-shard-count index still fits in OS disk cache on fewer nodes.
  • Observability is correct — you're measuring coordinator-view latency (concepts/metric-granularity-mismatch) not per-shard.

Anti-patterns / when not to apply

  • Log or analytics workloads that actually scan most docs.
  • Clusters where per-shard work is CPU-bound on each worker — bigger shards will just move the bottleneck.
  • Single-index-per-tenant deployments where shard count is driven by isolation / quota concerns, not performance.

Seen in

Last updated · 200 distilled / 1,178 read