CONCEPT Cited by 1 source

Multi-card embedding sharding¶

Definition¶

Multi-card embedding sharding is the serving-side mechanism that splits a recommendation model's embedding tables into segments distributed across multiple GPUs when the combined embedding footprint exceeds the memory capacity of a single GPU. Each GPU holds a shard; lookups are routed to the owning card and results are aggregated using hardware-specific communication optimisations. The mechanism achieves performance parity with single-card setups while decoupling model complexity from individual GPU hardware constraints (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Why recsys hits the memory wall first¶

Unlike LLMs, recommendation models are dominated by sparse categorical features mapped to high-dimensional embedding tables. As model quality grows with table size (more unique IDs → fewer hash collisions — see concepts/hash-collision-embedding-tradeoff), embedding tables eventually cross the terabyte boundary, exceeding the 80-192 GB capacity of any single GPU.

Meta: "As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU."

The LLM comparison point: Llama 3.1 405B at BF16 is ~810 GB of weights; Meta Adaptive Ranking Model's embedding tables are in the same order. The difference is that LLM weights are fairly uniform matrix blocks; embedding tables are lookup-access patterns — a different sharding geometry.

How sharding works mechanically¶

Split the embedding table(s) into segments — typically by partitioning the ID space (hash-based, range-based, or feature-per-shard).
Distribute segments across an optimised GPU cluster — each GPU owns a disjoint subset of the table.
Route lookups to the owning shard — via the NIC / NVLink / intra-host interconnect depending on topology.
Aggregate results — return the assembled embedding vector to the ranking pipeline.

The Adaptive Ranking Model post describes it as: "a multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster. By leveraging hardware-specific communication optimizations, the system maintains high throughput and efficient communication between shards."

The parity-with-single-card claim¶

"This multi-card architecture achieves performance parity with single-card setups, effectively decoupling model complexity from individual GPU hardware constraints."

Achieving parity requires:

Low-latency interconnect (NVLink / NVSwitch / similar) so cross-shard lookups don't bottleneck on PCIe.
Communication-optimised lookup aggregation so the per-request network overhead is small relative to the embedding-gather time.
Load balance across shards — achieved by hashing or careful feature placement, avoiding hotspotting on a single card.

Distinction from tensor / pipeline parallelism¶

Multi-card embedding sharding is not:

Tensor parallelism — sharding dense weight matrices along one dimension for forward-pass parallelism.
Pipeline parallelism — splitting model layers across stages.
Data parallelism — replicating full model, splitting batches.

It's parameter sharding specific to lookup tables: the access pattern is gather-scatter, not matmul, and the parallelism axis is the ID space, not the hidden dimension or the batch.

Multi-card sharding is the last lever Meta applies to terabyte-scale embeddings. Earlier in the chain:

Allocate hash sizes by feature sparsity — sparse features get smaller tables.
Prune unused embeddings — drop never-accessed slots.
Unified embeddings — share one table across multiple features.

Only when these fail to fit the model into single-GPU memory does multi-card sharding engage.

Seen in¶

2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source; introduces multi-card embedding sharding as the mechanism unlocking O(1T) parameter serving, achieving performance parity with single-card setups (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Caveats¶

Meta does not disclose card count, sharding granularity (feature-level / ID-range / hash-based), interconnect used (NVLink? InfiniBand-over-RoCE?), or cross-shard lookup latency.
"Performance parity" is asserted qualitatively; no latency delta (single-card vs. multi-card) is quantified.
Load-balance mechanism is not described.