PATTERN Cited by 1 source
Fewer, larger shards for latency-sensitive workloads¶
Intent¶
In a coordinator-fan-out search system (OpenSearch / Elasticsearch / any scatter-gather index) tuned for latency-sensitive document search with effective pre-filters, shrink the shard count and grow the per- shard size instead of following the vendor's throughput-workload shard-sizing guideline. The coordinator's fan-out/collect cost is the dominant term once filters cut the per-shard cost to near-zero, and reducing N (fan-out width) directly attacks p50, p99, and max QPS together.
Context¶
OpenSearch / Elasticsearch exposes three handles that interact:
- Shard count per index — fan-out width per query.
- Shard size — work each worker does per query.
- Node CPU/RAM/disk mix — throughput ceiling per worker.
AWS's published sizing guidance (shards <50 GB, ~1 shard per 1.5 CPUs) is optimised for log-style throughput workloads with wide scans and no strong filters. For those workloads, fan-out is a feature: more shards parallelise the scan.
For latency-sensitive document search where filters eliminate the majority of docs per shard (so per-shard work is already small) the coordinator cost of collecting and sorting results from N shards grows faster than any per-shard speedup, and the tail latency of the slowest-of-N shard drags the whole query.
Mechanism¶
- Confirm pre-filters are effective (OpenSearch query profiler shows only a few hundred docs visited per shard per query).
- Measure end-to-end coordinator-view latency, not per-shard latency (concepts/metric-granularity-mismatch).
- Shrink shard count; grow shard size. Rerun load test with a custom harness.
- Watch for the counterintuitive signal: P50 should decrease too, not just the tail. If only the tail moves you haven't removed enough fan-out.
Canonical instance (Figma, 2026)¶
Figma's main search index went from 450 shards → 180 shards (−60%) for the same corpus. Results on their coordinator-view metrics:
- ≥50% boost in max query rate before excess latency set in.
- P50 latency decreased, not just p99 — direct evidence that coordinator collection cost had been dominating even the median.
- Filter-effectiveness meant per-shard workers weren't under-utilized at the larger per-shard size: "our filters were very effective, we actually saw better performance with fewer, larger shards."
Paired with index-size reduction (50% then 90%) so the live set fit in OS disk cache — cache-hit-rate stability is what unlocked the larger-shard math (see concepts/cache-locality).
Why it's counterintuitive¶
- Parallelism intuition says more shards = faster. True for scan workloads; false once per-shard work is filter-dominated and coordinator fan-out is the bottleneck.
- Vendor defaults and docs point the other way — AWS's sizing guidance was explicitly calibrated on throughput-intensive workloads and did not fit Figma's latency-sensitive shape. Measuring beat the recommendation.
- Bigger shards sound like bigger blast radius. They are — but in a latency-sensitive deployment with RAM sized for disk-cache residency, the operational risk is managed, and the latency win is immediate.
Preconditions / when it applies¶
- Latency-sensitive workload (not log search / analytics scan).
- Pre-filters cut per-shard doc count to hundreds not millions.
- Sufficient RAM that the reduced-shard-count index still fits in OS disk cache on fewer nodes.
- Observability is correct — you're measuring coordinator-view latency (concepts/metric-granularity-mismatch) not per-shard.
Anti-patterns / when not to apply¶
- Log or analytics workloads that actually scan most docs.
- Clusters where per-shard work is CPU-bound on each worker — bigger shards will just move the bottleneck.
- Single-index-per-tenant deployments where shard count is driven by isolation / quota concerns, not performance.
Seen in¶
- sources/2026-04-21-figma-the-search-for-speed-in-figma — 450 → 180 shards cut both P50 and p99, ≥50% QPS headroom, combined with a fitter-in-disk-cache index (50% then 90% trim) and a cheaper node mix (1/3 CPU, 25% more RAM, ≈1/2 price).
Related¶
- patterns/custom-benchmarking-harness — prerequisite for measuring the shard-count sweep at all; vendor benchmarks surface client-side latency by default.
- concepts/metric-granularity-mismatch — the observability prerequisite for doing this work at all.
- systems/amazon-opensearch-service / systems/elasticsearch — the substrate the pattern targets.