CONCEPT Cited by 1 source
Document sharding¶
Definition¶
Document sharding is the horizontal-scale-out primitive used in Meta's SilverTorch retrieval substrate when "the neural network exceeds a single host's capacity" (Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems):
"We use document sharding: split the item inventory (videos, posts, photos) across hosts, like splitting a large library's catalog across branches."
In recsys-retrieval terminology, "documents" = candidate items (videos / posts / photos / ads); "sharding" = partition the item set across hosts so each host holds a subset. At request time, the user's query vector fans out to all shards in parallel, each shard returns its local top-K, and the per-shard top-K's are merged into a global top-K.
Why it's the third tier of scale-out¶
SilverTorch's scale-up-first strategy proceeds in tiers:
- Scale up on a single GPU by orchestrating its memory hierarchy (SRAM → HBM → host DRAM → remote DRAM).
- Scale out within a host across high-bandwidth GPU-to-GPU interconnects.
- Document-shard across hosts — this concept.
- Sparse-table sharding via TorchRec for embedding tables, which follows independent placement rules.
Document sharding sits at the boundary where single-host capacity is exceeded — when the dense retrieval model + item-index tensors no longer fit in one machine's memory hierarchy.
Why it works¶
The retrieval workload is inherently embarrassingly-parallel across the item axis. Each shard can independently:
- Run ANN search over its local item subset.
- Apply eligibility filtering on its local items.
- Compute composite scores on its local survivors.
The merge stage is cheap — N shards each return small top-K sets, the global top-K is a single sort. This is why document sharding scales near-linearly with shard count for the retrieval workload: per-request work scales with total_items / num_shards, the merge overhead scales with num_shards × top_K_per_shard, and for typical operating points the per-shard work dominates.
What's not shared across shards¶
Each shard runs an independent SilverTorch model instance with its own slice of item embeddings. What is shared:
- The user-tower computation — the user query embedding is computed once and fanned out to all shards (the tower itself can be replicated on every shard).
- The model architecture and weights for the shared layers (user tower, neural reranker, scoring) — shipped to every shard via standard model deployment.
- Streaming weight updates — apply per-tensor, but routed to the shard that owns the affected item.
Relationship to other sharding primitives on the wiki¶
- concepts/dynamic-sharding focuses on online resharding (adding / removing shards while serving).
- concepts/multi-card-embedding-sharding is the within-host multi-GPU embedding-table instance — splitting one embedding table across cards on the same machine. Document sharding sits one tier above this (across hosts, not within one host).
- systems/torchrec is the library that orchestrates sparse-table sharding across the GPU memory hierarchy. Document sharding is the dense-model horizontal scale-out tier complementary to TorchRec's sparse-table sharding — they address different model dimensions.
Caveats¶
- The post does not disclose the document-sharding policy: hash-based / range-based / popularity-aware / freshness-aware. Each has different cost/quality tradeoffs (popularity-aware reduces tail-latency variance but complicates streaming updates; range-based simplifies updates but creates hot shards on viral content).
- Replication factor across shards is not disclosed — likely production deployments replicate each shard for availability + load.
- The merge stage's exact algorithm (parallel top-K merge, global re-rank, ...) is not detailed.
- The shard count at which SilverTorch operates in production is not disclosed.