CONCEPT Cited by 1 source

Document sharding¶

Definition¶

Document sharding is the horizontal-scale-out primitive used in Meta's SilverTorch retrieval substrate when "the neural network exceeds a single host's capacity" (Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems):

"We use document sharding: split the item inventory (videos, posts, photos) across hosts, like splitting a large library's catalog across branches."

In recsys-retrieval terminology, "documents" = candidate items (videos / posts / photos / ads); "sharding" = partition the item set across hosts so each host holds a subset. At request time, the user's query vector fans out to all shards in parallel, each shard returns its local top-K, and the per-shard top-K's are merged into a global top-K.

Why it's the third tier of scale-out¶

SilverTorch's scale-up-first strategy proceeds in tiers:

Scale up on a single GPU by orchestrating its memory hierarchy (SRAM → HBM → host DRAM → remote DRAM).
Scale out within a host across high-bandwidth GPU-to-GPU interconnects.
Document-shard across hosts — this concept.
Sparse-table sharding via TorchRec for embedding tables, which follows independent placement rules.

Document sharding sits at the boundary where single-host capacity is exceeded — when the dense retrieval model + item-index tensors no longer fit in one machine's memory hierarchy.

Why it works¶

The retrieval workload is inherently embarrassingly-parallel across the item axis. Each shard can independently:

Run ANN search over its local item subset.
Apply eligibility filtering on its local items.
Compute composite scores on its local survivors.

The merge stage is cheap — N shards each return small top-K sets, the global top-K is a single sort. This is why document sharding scales near-linearly with shard count for the retrieval workload: per-request work scales with total_items / num_shards, the merge overhead scales with num_shards × top_K_per_shard, and for typical operating points the per-shard work dominates.

What's not shared across shards¶

Each shard runs an independent SilverTorch model instance with its own slice of item embeddings. What is shared:

The user-tower computation — the user query embedding is computed once and fanned out to all shards (the tower itself can be replicated on every shard).
The model architecture and weights for the shared layers (user tower, neural reranker, scoring) — shipped to every shard via standard model deployment.
Streaming weight updates — apply per-tensor, but routed to the shard that owns the affected item.

Relationship to other sharding primitives on the wiki¶

concepts/dynamic-sharding focuses on online resharding (adding / removing shards while serving).
concepts/multi-card-embedding-sharding is the within-host multi-GPU embedding-table instance — splitting one embedding table across cards on the same machine. Document sharding sits one tier above this (across hosts, not within one host).
systems/torchrec is the library that orchestrates sparse-table sharding across the GPU memory hierarchy. Document sharding is the dense-model horizontal scale-out tier complementary to TorchRec's sparse-table sharding — they address different model dimensions.

Caveats¶

The post does not disclose the document-sharding policy: hash-based / range-based / popularity-aware / freshness-aware. Each has different cost/quality tradeoffs (popularity-aware reduces tail-latency variance but complicates streaming updates; range-based simplifies updates but creates hot shards on viral content).
Replication factor across shards is not disclosed — likely production deployments replicate each shard for availability + load.
The merge stage's exact algorithm (parallel top-K merge, global re-rank, ...) is not detailed.
The shard count at which SilverTorch operates in production is not disclosed.

Seen in¶

sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems