Skip to content

PATTERN Cited by 1 source

Multi-card sharded embedding serving

Pattern

When a ranking model's embedding tables exceed single-GPU memory:

  1. Split each (or all) embedding table(s) into disjoint segments across multiple GPUs in a hardware-aware cluster topology.
  2. Route lookups to the owning shard over a low-latency interconnect (NVLink / NVSwitch / equivalent).
  3. Aggregate the gathered embeddings using hardware-specific communication optimisations so cross-shard lookup overhead is small relative to the embedding-gather time.

Achieves performance parity with single-card setups while decoupling model parameter count from single-GPU memory ceilings. Enables O(1T) parameter serving at Meta scale (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Problem

Recsys models are dominated by sparse categorical features mapped to high-dimensional embedding tables. As tables grow (driven by feature cardinality and the hash-collision tradeoff), the aggregate embedding footprint eventually crosses the terabyte boundary — exceeding the 80-192 GB memory of any single GPU:

"As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU."

Before this pattern engages, earlier memory-optimisation levers are applied: sparsity-aware hash-size allocation, unused-embedding pruning, unified embeddings. Multi- card sharding is the last lever when those don't fit the model within a single GPU.

Solution

"A multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster. By leveraging hardware-specific communication optimizations, the system maintains high throughput and efficient communication between shards. This multi-card architecture achieves performance parity with single-card setups, effectively decoupling model complexity from individual GPU hardware constraints."

Sharding geometry

  • ID-range partitioning — each GPU owns a disjoint range of the ID space.
  • Hash-based partitioning — each GPU owns IDs whose hash falls into its range.
  • Feature-per-shard — each GPU owns entire tables for specific features.

The Adaptive Ranking Model post does not name which geometry Meta uses; in practice production systems mix these depending on access patterns.

Communication layer

  • Intra-host interconnect (NVLink / NVSwitch) for the hottest lookups inside a single host.
  • Optimised collectives (all-to-all or hierarchical gather) tuned to the embedding-lookup pattern, not the generic matmul collective.
  • Overlap of lookup with downstream ranking computation so cross-shard latency is hidden behind useful work.

Performance parity

Parity with single-card setups requires:

  • Low enough cross-shard latency that the lookup is not the bottleneck.
  • Sufficient load balance across shards that no single card hotspots.
  • Communication optimisations that exploit hardware-specific interconnect (NVLink topology awareness, switch-level routing).

Forces

  • Embedding tables at LLM scale exceed single-GPU memory.
  • Lookup latency is on the critical path of the sub-second ranking budget — sharded lookups must not add significant overhead.
  • Load balance matters — skewed feature access patterns can hotspot a single card and destroy parity.
  • Heterogeneous hardware is a reality — the sharding layer has to work across vendor / generation mixes.

Consequences

Positive:

  • Unblocks O(1T) parameter ranking models.
  • Performance parity with single-card setups (per Meta).
  • Decouples model scale from single-GPU memory ceilings — accommodates future model growth without waiting for larger GPUs.

Negative / tradeoffs:

  • Interconnect dependency — sharding only works well on hosts with fast intra-node interconnect; commodity hardware is not suitable.
  • Operational complexity — multi-GPU serving is harder to debug, monitor, and roll out than single-GPU.
  • Load balance maintenance — access pattern drift can require rebalancing shards over time.
  • Failure domain widens — a single GPU failure affects the full model's embedding availability, not just a shard.

Complementary patterns (the full memory stack)

The Adaptive Ranking Model post describes the complete memory stack Meta applies to embedding-scale challenges:

  1. Sparsity-aware hash size allocation — right-size each feature's table.
  2. Unused-embedding pruning — drop never-accessed slots.
  3. Unified embeddings — multiple features share one table.
  4. Multi-card sharded embedding serving (this pattern) — when the shrunken aggregate still exceeds single-GPU memory.

The first three shrink the footprint; this pattern handles the residual that remains.

Canonical industrial instance

Seen in

Last updated · 319 distilled / 1,201 read