SYSTEM Cited by 3 sources
NVIDIA H100¶
The NVIDIA H100 (Hopper architecture, 2022-2023) is the data-centre-class GPU that has become the default substrate for frontier LLM training and large-model inference. The 80GB SXM variant (HBM3, ~3 TB/s memory bandwidth, Transformer Engine FP8) is the one most commonly referenced in large-scale training deployments in this wiki.
Seen in (wiki)¶
- eBay e-Llama training cluster. 60 nodes × 8 × H100 80GB = 480 H100s used for continued pretraining of Llama 3.1 8B + 70B on 1T tokens. Intra-node NVLink + inter-node InfiniBand. ~1 month wall-clock / ~340k GPU-hours on the 70B run. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
- Fly.io GPU lineup — negative-space datum. H100 was the scarce frontier part Fly.io was chasing before actual customer data revealed most inference workloads don't need H100-class compute. The 2024-08-15 L40S price cut is Fly.io's pivot away from the H100-centric framing: "If you're trying to do something GPU-accelerated in response to an HTTP request, the right combination of GPU, instance RAM, fast object storage for datasets and model parameters, and networking is much more important than getting your hands on an H100." Useful counter-weight to the training-side view of the H100. (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half; complement: concepts/inference-vs-training-workload-shape)
- Fly.io 2025-02-14 retrospective — H100 as the serious-AI ceiling Fly.io cannot reach. "People doing serious AI work want galactically huge amounts of GPU compute. A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s." The H100-SXM-ganged-cluster shape sits above the insurgent-cloud supply ceiling (see concepts/insurgent-cloud-constraints) — Nvidia allocations skew hyperscaler, and a Fly-shaped insurgent doesn't have the capacity to serve the training/frontier- inference segment. Complements both the 2024-08 L40S customer-data datum and the 2025-01 eBay training-side deployment. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)
Why it matters¶
- HBM capacity + bandwidth are the dominant constraints for both long-context training and decode-bound inference. 80GB HBM3 at ~3 TB/s sets the scale at which e.g. a 70B-parameter model can reasonably run per-GPU under tensor/pipeline parallelism.
- NVLink (intra-node) + InfiniBand / RoCE (inter-node) are the communication substrate that makes concepts/3d-parallelism|3D parallelism (DP + TP + PP) tractable at 480-GPU scale. Tensor parallelism's per-layer all-reduce only fits within the NVLink domain; pipeline parallelism tolerates the slower InfiniBand hop between stages.
- Transformer Engine FP8 offers ~2× throughput over BF16 on supported kernels, though the eBay post does not disclose FP8 vs BF16 choice for e-Llama training.
Stub — expand as more sources cite H100 specifics (per-GPU TFLOPs, HBM configurations, FP8 adoption, power/cooling, lifetime, NVL72-class topologies).
Related¶
- systems/nvlink — intra-node GPU-to-GPU interconnect.
- systems/infiniband — inter-node fabric.
- systems/megatron-lm — the LLM training framework most commonly orchestrating H100 clusters in this wiki.
- systems/e-llama — an eBay model trained on an H100 fleet.
- systems/flash-attention-2 — the attention kernel tuned for H100.
- concepts/multi-gpu-serving / concepts/3d-parallelism