CONCEPT Cited by 2 sources

GPU memory hierarchy¶

Definition¶

The GPU memory hierarchy is the multi-tier memory layout that any high-performance GPU workload must explicitly reason about. The four tiers, ordered from closest-to-compute to farthest:

On-chip SRAM — per-SM register files + shared memory + L1 / L2 caches; fastest but smallest (tens of MB total per device).
GPU-resident HBM — high-bandwidth memory soldered to the GPU package; the canonical "GPU memory" (e.g., 80 GB on H100, 192 GB on B200).
Host DRAM — system RAM on the same machine, reachable via PCIe / NVLink-C2C; tens of GB/s to ~hundreds of GB/s bandwidth, hundreds of GB to TB capacity.
Remote DRAM — DRAM on a different machine, reached over the cluster fabric; lowest bandwidth, highest capacity.

The named hierarchy comes verbatim from Meta's 2026-05-26 SilverTorch post:

"We make the most of the single high-performance GPU by carefully orchestrating its memory hierarchy (on-chip SRAM, GPU-resident HBM, host DRAM, remote DRAM) so data lives close to where it's computed."

(Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems)

Why it matters for ML serving¶

For dense models that fit comfortably in HBM, memory placement is uninteresting — everything lives on-GPU. For modern recsys + LLM models the constraint is real: total parameter / state size routinely exceeds single-GPU HBM, and explicit placement decisions determine end-to-end performance.

The wiki's canonical instances:

SilverTorch (2026-05-26) uses the hierarchy as the basis for its scale-up-first strategy: maximise SRAM + HBM utilisation, then scale-out within a host across NVLink, then shard documents across hosts. For TorchRec-managed sparse embedding tables, the hierarchy explicitly extends across HBM + GPU host DRAM + "even remote CPU-host DRAM, decoupling sparse data movement from computation."
Meta Adaptive Ranking Model (MARM, 2026-03-31) uses the same framing under the concepts/hardware-aware-model-architecture heading — selective FP8 quantisation, kernel fusion, and grouped-GEMM are all decisions optimised against the SRAM/HBM/host-DRAM bandwidth ratios.
concepts/multi-card-embedding-sharding is the recsys-serving instance of the hierarchy at the multi-GPU-within-host level — embedding tables exceeding single-GPU memory get split across cards on the same machine.

Bandwidth + latency asymmetry — why placement decisions matter¶

Each tier has roughly an order-of-magnitude bandwidth gap to the next. Crossing tiers costs cycles the GPU's compute cores spend stalled. The two structural responses:

Co-locate hot data with compute. Embeddings frequently looked up should live in HBM; rarely-touched parameters can sit in host DRAM with explicit prefetch. Architectures like SilverTorch's fused Int8 ANN explicitly pack item embeddings into Int8 to fit larger candidate pools in HBM.
Decouple data movement from compute. TorchRec "decoupl[es] sparse data movement from computation" by pipelining lookups across the hierarchy so the dense compute can run uninterrupted.

Hardware substrate¶

The hierarchy is physical, not abstract. Specific machines on the wiki:

H100 — 80 GB HBM3, 3 TB/s HBM bandwidth, NVLink to 7 peer H100s on a 700 W Grand Teton baseboard.
GB200 Grace Blackwell — Blackwell GPU + Grace CPU coherent via NVLink-C2C; the GPU↔host-DRAM tier becomes substantially less expensive on this generation.
RTX Pro 6000 Blackwell — 96 GB HBM, 1.7 TB/s.

Each generation shifts the relative costs of crossing tiers, which in turn shifts the optimal placement policy. Architectures like SilverTorch and TorchRec are written to express the placement policy explicitly rather than hardcode for a generation.

Caveats¶

The four-tier framing in this page is the SilverTorch / MARM / TorchRec view. Some literature partitions further (register file vs shared memory vs L1, on-package HBM vs off-package CXL memory, NVMe-as-memory tiers). The four-tier shape is the most useful framing for recsys-serving placement decisions today.
Future tiers will land — CXL-attached pooled memory, NVLink-extended remote DRAM, on-chip MRAM. The principle (place hot data closest to compute) is stable; the specific tier count and bandwidth gaps move with hardware generations.