PATTERN Cited by 1 source

Fewer, larger shards for latency-sensitive workloads¶

Intent¶

In a coordinator-fan-out search system (OpenSearch / Elasticsearch / any scatter-gather index) tuned for latency-sensitive document search with effective pre-filters, shrink the shard count and grow the per- shard size instead of following the vendor's throughput-workload shard-sizing guideline. The coordinator's fan-out/collect cost is the dominant term once filters cut the per-shard cost to near-zero, and reducing N (fan-out width) directly attacks p50, p99, and max QPS together.

Context¶

OpenSearch / Elasticsearch exposes three handles that interact:

Shard count per index — fan-out width per query.
Shard size — work each worker does per query.
Node CPU/RAM/disk mix — throughput ceiling per worker.

AWS's published sizing guidance (shards <50 GB, ~1 shard per 1.5 CPUs) is optimised for log-style throughput workloads with wide scans and no strong filters. For those workloads, fan-out is a feature: more shards parallelise the scan.

For latency-sensitive document search where filters eliminate the majority of docs per shard (so per-shard work is already small) the coordinator cost of collecting and sorting results from N shards grows faster than any per-shard speedup, and the tail latency of the slowest-of-N shard drags the whole query.

Mechanism¶

Confirm pre-filters are effective (OpenSearch query profiler shows only a few hundred docs visited per shard per query).
Measure end-to-end coordinator-view latency, not per-shard latency (concepts/metric-granularity-mismatch).
Shrink shard count; grow shard size. Rerun load test with a custom harness.
Watch for the counterintuitive signal: P50 should decrease too, not just the tail. If only the tail moves you haven't removed enough fan-out.

Canonical instance (Figma, 2026)¶

Figma's main search index went from 450 shards → 180 shards (−60%) for the same corpus. Results on their coordinator-view metrics:

≥50% boost in max query rate before excess latency set in.
P50 latency decreased, not just p99 — direct evidence that coordinator collection cost had been dominating even the median.
Filter-effectiveness meant per-shard workers weren't under-utilized at the larger per-shard size: "our filters were very effective, we actually saw better performance with fewer, larger shards."

Paired with index-size reduction (50% then 90%) so the live set fit in OS disk cache — cache-hit-rate stability is what unlocked the larger-shard math (see concepts/cache-locality).

Why it's counterintuitive¶

Parallelism intuition says more shards = faster. True for scan workloads; false once per-shard work is filter-dominated and coordinator fan-out is the bottleneck.
Vendor defaults and docs point the other way — AWS's sizing guidance was explicitly calibrated on throughput-intensive workloads and did not fit Figma's latency-sensitive shape. Measuring beat the recommendation.
Bigger shards sound like bigger blast radius. They are — but in a latency-sensitive deployment with RAM sized for disk-cache residency, the operational risk is managed, and the latency win is immediate.

Preconditions / when it applies¶

Latency-sensitive workload (not log search / analytics scan).
Pre-filters cut per-shard doc count to hundreds not millions.
Sufficient RAM that the reduced-shard-count index still fits in OS disk cache on fewer nodes.
Observability is correct — you're measuring coordinator-view latency (concepts/metric-granularity-mismatch) not per-shard.

Anti-patterns / when not to apply¶

Log or analytics workloads that actually scan most docs.
Clusters where per-shard work is CPU-bound on each worker — bigger shards will just move the bottleneck.
Single-index-per-tenant deployments where shard count is driven by isolation / quota concerns, not performance.

Seen in¶

sources/2026-04-21-figma-the-search-for-speed-in-figma — 450 → 180 shards cut both P50 and p99, ≥50% QPS headroom, combined with a fitter-in-disk-cache index (50% then 90% trim) and a cheaper node mix (1/3 CPU, 25% more RAM, ≈1/2 price).

patterns/custom-benchmarking-harness — prerequisite for measuring the shard-count sweep at all; vendor benchmarks surface client-side latency by default.
concepts/metric-granularity-mismatch — the observability prerequisite for doing this work at all.
systems/amazon-opensearch-service / systems/elasticsearch — the substrate the pattern targets.