CONCEPT Cited by 1 source
Multi-card embedding sharding¶
Definition¶
Multi-card embedding sharding is the serving-side mechanism that splits a recommendation model's embedding tables into segments distributed across multiple GPUs when the combined embedding footprint exceeds the memory capacity of a single GPU. Each GPU holds a shard; lookups are routed to the owning card and results are aggregated using hardware-specific communication optimisations. The mechanism achieves performance parity with single-card setups while decoupling model complexity from individual GPU hardware constraints (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Why recsys hits the memory wall first¶
Unlike LLMs, recommendation models are dominated by sparse categorical features mapped to high-dimensional embedding tables. As model quality grows with table size (more unique IDs → fewer hash collisions — see concepts/hash-collision-embedding-tradeoff), embedding tables eventually cross the terabyte boundary, exceeding the 80-192 GB capacity of any single GPU.
Meta: "As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU."
The LLM comparison point: Llama 3.1 405B at BF16 is ~810 GB of weights; Meta Adaptive Ranking Model's embedding tables are in the same order. The difference is that LLM weights are fairly uniform matrix blocks; embedding tables are lookup-access patterns — a different sharding geometry.
How sharding works mechanically¶
- Split the embedding table(s) into segments — typically by partitioning the ID space (hash-based, range-based, or feature-per-shard).
- Distribute segments across an optimised GPU cluster — each GPU owns a disjoint subset of the table.
- Route lookups to the owning shard — via the NIC / NVLink / intra-host interconnect depending on topology.
- Aggregate results — return the assembled embedding vector to the ranking pipeline.
The Adaptive Ranking Model post describes it as: "a multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster. By leveraging hardware-specific communication optimizations, the system maintains high throughput and efficient communication between shards."
The parity-with-single-card claim¶
"This multi-card architecture achieves performance parity with single-card setups, effectively decoupling model complexity from individual GPU hardware constraints."
Achieving parity requires:
- Low-latency interconnect (NVLink / NVSwitch / similar) so cross-shard lookups don't bottleneck on PCIe.
- Communication-optimised lookup aggregation so the per-request network overhead is small relative to the embedding-gather time.
- Load balance across shards — achieved by hashing or careful feature placement, avoiding hotspotting on a single card.
Distinction from tensor / pipeline parallelism¶
Multi-card embedding sharding is not:
- Tensor parallelism — sharding dense weight matrices along one dimension for forward-pass parallelism.
- Pipeline parallelism — splitting model layers across stages.
- Data parallelism — replicating full model, splitting batches.
It's parameter sharding specific to lookup tables: the access pattern is gather-scatter, not matmul, and the parallelism axis is the ID space, not the hidden dimension or the batch.
Related memory-optimisation levers¶
Multi-card sharding is the last lever Meta applies to terabyte-scale embeddings. Earlier in the chain:
- Allocate hash sizes by feature sparsity — sparse features get smaller tables.
- Prune unused embeddings — drop never-accessed slots.
- Unified embeddings — share one table across multiple features.
Only when these fail to fit the model into single-GPU memory does multi-card sharding engage.
Seen in¶
- 2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source; introduces multi-card embedding sharding as the mechanism unlocking O(1T) parameter serving, achieving performance parity with single-card setups (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).
Caveats¶
- Meta does not disclose card count, sharding granularity (feature-level / ID-range / hash-based), interconnect used (NVLink? InfiniBand-over-RoCE?), or cross-shard lookup latency.
- "Performance parity" is asserted qualitatively; no latency delta (single-card vs. multi-card) is quantified.
- Load-balance mechanism is not described.