PATTERN Cited by 1 source

Multi-card sharded embedding serving¶

Pattern¶

When a ranking model's embedding tables exceed single-GPU memory:

Split each (or all) embedding table(s) into disjoint segments across multiple GPUs in a hardware-aware cluster topology.
Route lookups to the owning shard over a low-latency interconnect (NVLink / NVSwitch / equivalent).
Aggregate the gathered embeddings using hardware-specific communication optimisations so cross-shard lookup overhead is small relative to the embedding-gather time.

Achieves performance parity with single-card setups while decoupling model parameter count from single-GPU memory ceilings. Enables O(1T) parameter serving at Meta scale (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Problem¶

Recsys models are dominated by sparse categorical features mapped to high-dimensional embedding tables. As tables grow (driven by feature cardinality and the hash-collision tradeoff), the aggregate embedding footprint eventually crosses the terabyte boundary — exceeding the 80-192 GB memory of any single GPU:

"As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU."

Before this pattern engages, earlier memory-optimisation levers are applied: sparsity-aware hash-size allocation, unused-embedding pruning, unified embeddings. Multi- card sharding is the last lever when those don't fit the model within a single GPU.

Solution¶

"A multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster. By leveraging hardware-specific communication optimizations, the system maintains high throughput and efficient communication between shards. This multi-card architecture achieves performance parity with single-card setups, effectively decoupling model complexity from individual GPU hardware constraints."

Sharding geometry¶

ID-range partitioning — each GPU owns a disjoint range of the ID space.
Hash-based partitioning — each GPU owns IDs whose hash falls into its range.
Feature-per-shard — each GPU owns entire tables for specific features.

The Adaptive Ranking Model post does not name which geometry Meta uses; in practice production systems mix these depending on access patterns.

Communication layer¶

Intra-host interconnect (NVLink / NVSwitch) for the hottest lookups inside a single host.
Optimised collectives (all-to-all or hierarchical gather) tuned to the embedding-lookup pattern, not the generic matmul collective.
Overlap of lookup with downstream ranking computation so cross-shard latency is hidden behind useful work.

Performance parity¶

Parity with single-card setups requires:

Low enough cross-shard latency that the lookup is not the bottleneck.
Sufficient load balance across shards that no single card hotspots.
Communication optimisations that exploit hardware-specific interconnect (NVLink topology awareness, switch-level routing).

Forces¶

Embedding tables at LLM scale exceed single-GPU memory.
Lookup latency is on the critical path of the sub-second ranking budget — sharded lookups must not add significant overhead.
Load balance matters — skewed feature access patterns can hotspot a single card and destroy parity.
Heterogeneous hardware is a reality — the sharding layer has to work across vendor / generation mixes.

Consequences¶

Positive:

Unblocks O(1T) parameter ranking models.
Performance parity with single-card setups (per Meta).
Decouples model scale from single-GPU memory ceilings — accommodates future model growth without waiting for larger GPUs.

Negative / tradeoffs:

Interconnect dependency — sharding only works well on hosts with fast intra-node interconnect; commodity hardware is not suitable.
Operational complexity — multi-GPU serving is harder to debug, monitor, and roll out than single-GPU.
Load balance maintenance — access pattern drift can require rebalancing shards over time.
Failure domain widens — a single GPU failure affects the full model's embedding availability, not just a shard.

Complementary patterns (the full memory stack)¶

The Adaptive Ranking Model post describes the complete memory stack Meta applies to embedding-scale challenges:

Sparsity-aware hash size allocation — right-size each feature's table.
Unused-embedding pruning — drop never-accessed slots.
Unified embeddings — multiple features share one table.
Multi-card sharded embedding serving (this pattern) — when the shrunken aggregate still exceeds single-GPU memory.

The first three shrink the footprint; this pattern handles the residual that remains.

Canonical industrial instance¶

Meta Adaptive Ranking Model — this post; serves O(1T) parameters on Instagram with multi-card sharding achieving parity with single-card setups. Combined with request- centric inference and model-system co-design.

Seen in¶

2026-03-31 Meta — Meta Adaptive Ranking Model — canonical wiki source (sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).