PATTERN Cited by 1 source
Scale up first, then scale out (GPU)¶
When to apply¶
Use this pattern when:
- A workload runs on GPU substrate with non-trivial memory placement decisions — typical of recsys-serving, LLM-serving, large-vector-search, embedding-table-heavy ML workloads.
- The model + data exceed easy-fit-on-one-GPU but cross-host scale-out costs orders-of-magnitude more network / coordination overhead than within-host scale-out.
- The placement choice between (a) bigger single-GPU memory, (b) multi-GPU within one host, (c) sharding across hosts has measurable cost / latency / quality tradeoffs.
The pattern¶
Saturate the closest-to-compute resources first; only move outward when those are exhausted. SilverTorch's explicit four-tier strategy (Source: sources/2026-05-26-meta-silvertorch-index-as-model-a-new-retrieval-paradigm-for-recommendation-systems):
- Scale up on a single GPU. "We make the most of the single high-performance GPU by carefully orchestrating its memory hierarchy (on-chip SRAM, GPU-resident HBM, host DRAM, remote DRAM) so data lives close to where it's computed." Place hot tensors in HBM, warm tensors in host DRAM, cold tensors in remote DRAM, all behind one GPU's compute.
- Scale out within a host. "Once we've maximized a single GPU, we scale out within a host, taking advantage of high-bandwidth interconnects between GPU cards on the same machine." NVLink / NVSwitch bandwidth dominates cross-host network bandwidth by an order of magnitude.
- Document-shard across hosts. "When the neural network exceeds a single host's capacity, we use document sharding: split the item inventory (videos, posts, photos) across hosts."
- Sparse-table sharding via TorchRec. "For the very large sparse networks inside the model — embedding tables that map every item and every user feature to a learned vector — we use TorchRec, PyTorch's library for sparse-table sharding. TorchRec spreads these tables across HBM, GPU host DRAM, and even remote CPU-host DRAM, decoupling sparse data movement from computation."
This is not a rule that says "never scale out across hosts" — it's a rule about ordering. Cross-host sharding is the right tool when single-host capacity is genuinely exceeded; using it before exhausting on-chip and within-host options pays cross-host overhead unnecessarily.
Why it works¶
Each tier has roughly an order-of-magnitude bandwidth gap to the next:
| Tier | Bandwidth (rough) | Capacity |
|---|---|---|
| On-chip SRAM | ~10 TB/s | tens of MB |
| HBM | ~3 TB/s (H100) | tens to hundreds of GB |
| Host DRAM (PCIe / NVLink-C2C) | ~tens to hundreds of GB/s | hundreds of GB to TB |
| Remote DRAM (cluster fabric) | ~tens of GB/s (RoCE, IB) | many TB |
Crossing tiers costs cycles the GPU's compute cores spend stalled. Saturating the close tier extracts the most compute-cycles-per-byte-moved before paying the next tier's overhead.
Disclosed manifestation in SilverTorch¶
The four-tier strategy isn't theoretical — it manifests in specific architectural choices:
- Fused Int8 ANN is fundamentally about HBM residency: Int8 storage halves embedding memory so twice the candidate pool fits on-chip.
- Kernel fusion keeps intermediate results in registers / shared memory instead of round-tripping through HBM.
- TorchRec sparse-table sharding is the explicit solution for embedding tables that exceed HBM but don't need to be cross-host (the sparse / dense placement asymmetry).
- Document sharding is the explicit cross-host tier — invoked only when the dense model has exhausted within-host options.
When the pattern is wrong¶
- Workloads that do not benefit from data locality. Pure compute-bound stencils with negligible state can scale out trivially without paying for tiered placement.
- Operationally simpler scale-out is cheaper. When the development cost of a sophisticated placement policy exceeds the GPU-cycle savings, horizontal scale-out via more hosts is the right answer. Cloud workloads with effectively-infinite GPUs and per-hour pricing often fit this case.
- Cross-host coordination is a competitive moat. Some platforms (e.g., very-large-scale ML training fleets) have already invested heavily in cross-host orchestration; layering tier-1 + tier-2 awareness on top of that gives diminishing returns.
Relationship to existing wiki material¶
- concepts/gpu-memory-hierarchy is the substrate concept this pattern operates over.
- concepts/document-sharding is the third-tier primitive.
- patterns/multi-card-sharded-embedding-serving (MARM 2026-03-31) is the within-host multi-GPU embedding-table instance that lives at the boundary between tiers 1, 2, and 4.
- patterns/unified-pytorch-model-as-retrieval-system is the architectural-level companion pattern — what the system looks like once the placement is correct.
Caveats¶
- The four-tier framing is the SilverTorch / TorchRec / MARM view. Other workloads partition further (CXL-attached pooled memory tiers, NVMe-as-memory tiers, on-chip MRAM future generations).
- The pattern assumes the workload has a clear hot / warm / cold partition that can be statically or near-statically mapped to tiers. Workloads with rapidly-shifting access patterns may need adaptive placement (which the post does not address).
- Document sharding's exact policy (hash / range / popularity-aware / freshness-aware) is not disclosed in the SilverTorch post — the right policy is workload-specific and is itself a within-pattern design decision.