SYSTEM Cited by 7 sources
NVIDIA H100¶
The NVIDIA H100 (Hopper architecture, 2022-2023) is the data-centre-class GPU that has become the default substrate for frontier LLM training and large-model inference. The 80GB SXM variant (HBM3, ~3 TB/s memory bandwidth, Transformer Engine FP8) is the one most commonly referenced in large-scale training deployments in this wiki.
Seen in (wiki)¶
- eBay e-Llama training cluster. 60 nodes × 8 × H100 80GB = 480 H100s used for continued pretraining of Llama 3.1 8B + 70B on 1T tokens. Intra-node NVLink + inter-node InfiniBand. ~1 month wall-clock / ~340k GPU-hours on the 70B run. (Source: sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development)
- Fly.io GPU lineup — negative-space datum. H100 was the scarce frontier part Fly.io was chasing before actual customer data revealed most inference workloads don't need H100-class compute. The 2024-08-15 L40S price cut is Fly.io's pivot away from the H100-centric framing: "If you're trying to do something GPU-accelerated in response to an HTTP request, the right combination of GPU, instance RAM, fast object storage for datasets and model parameters, and networking is much more important than getting your hands on an H100." Useful counter-weight to the training-side view of the H100. (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half; complement: concepts/inference-vs-training-workload-shape)
- Fly.io 2025-02-14 retrospective — H100 as the serious-AI ceiling Fly.io cannot reach. "People doing serious AI work want galactically huge amounts of GPU compute. A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s." The H100-SXM-ganged-cluster shape sits above the insurgent-cloud supply ceiling (see concepts/insurgent-cloud-constraints) — Nvidia allocations skew hyperscaler, and a Fly-shaped insurgent doesn't have the capacity to serve the training/frontier- inference segment. Complements both the 2024-08 L40S customer-data datum and the 2025-01 eBay training-side deployment. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)
- Instacart Intent Engine SRL — H100 as the latency-ceiling move. Instacart's production real-time SRL model (LoRA-fine-tuned Llama-3-8B, adapter-merged) missed its 300 ms latency target at ~700 ms on A100; the combination of adapter merge + upgrade to H100 was load-bearing to hit the target. FP8 quantization offered another 10% but was not shipped due to a recall regression. GPU autoscaling at off-peak manages cost. Canonical single-GPU-per-query inference deployment of H100 in the wiki — distinct from the training-cluster (eBay) and scale-of-ambition (Fly.io) framings. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
- Meta 24K-GPU GenAI clusters for Llama 3 (2024-06). Meta built two 24,000-GPU H100 clusters concurrently — one on RoCE, one on InfiniBand — on a modified Grand Teton platform. The H100 rollout was made at 700 W TDP (increased from stock), with HBM3, and retained air cooling because data-center cooling infrastructure could not change in time. Llama 3 trained on both clusters; the largest model on the RoCE cluster. First wiki datum on H100 at 24K-GPU-per-cluster scale; complements eBay's 480-GPU datum and Fly.io's single-chip framings. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
- Meta OCP AI-hardware vision (2024-10) — H100 as the predecessor generation Catalina supersedes. At OCP Summit 2024 Meta disclosed that Llama 3.1 405B "pushed our infrastructure to operate across more than 16,000 NVIDIA H100 GPUs" (training on a subset of the two 24K-GPU clusters). Meta then positioned the H100-based Grand Teton as the prior generation; the new Catalina rack (NVIDIA GB200 Blackwell, 140 kW liquid-cooled) is the successor platform. Canonical wiki statement that H100 is the Hopper-generation anchor and the GB200 succession is underway at Meta scale. (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
- Databricks × Superhuman 200K-QPS small-fast-LLM inference (2026-05-08) — per-pod 1,200 QPS at 50/50 token shape. Joint Databricks Model Serving / Superhuman post discloses per-pod throughput on H100 for a small fast LLM at the Superhuman grammar-correction request shape (~50 input + ~50 output tokens): 750 QPS pre-optimisation → 1,200 QPS post-optimisation (+60%). The post stacks the throughput gains: FP8 quantisation (up to +30%, single largest win) + multiprocessing RPC server for the CPU-bound regime small fast LLMs hit on H100 (+20%) + single-call C++ tensor manipulation in the CUDA-graph decode step (~few %) + async CPU-GPU scheduler (~few %). FP8 attention quantisation (Q/K/V/output projections) + FP8 MLP projections through the H100 Transformer Engine FP8 path produces "no measurable quality degradation" on Superhuman's evals when the kernels use per-channel rather than per-tensor scaling. KV-cache quantisation explicitly disabled for the workload ("weight quantization was where the throughput wins came from"). This is the wiki's first canonical small-fast-LLM-on-H100 disclosure (distinct from the large-model training-cluster (eBay 480-GPU), 24K-GPU-Meta-cluster, and adapter-merged-Llama-3-8B-on-Instacart datums above) and the first canonical disclosure of an L40S → H100 hardware-class migration at 200K+ QPS sustained — Superhuman ran the same grammar-correction workload on a DIY vLLM-on- L40S stack pre-migration; the post is cautious that "some of the throughput improvement attributed to software optimisations is enabled by the H100's faster Transformer-Engine FP8 path; the L40S baseline is not separately re-tested with the new optimisations." End-to-end SLO envelope for the H100 deployment: sub-1-second p99 latency at 200K+ QPS peak with 4-9's reliability. (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)
Why it matters¶
- HBM capacity + bandwidth are the dominant constraints for both long-context training and decode-bound inference. 80GB HBM3 at ~3 TB/s sets the scale at which e.g. a 70B-parameter model can reasonably run per-GPU under tensor/pipeline parallelism.
- NVLink (intra-node) + InfiniBand / RoCE (inter-node) are the communication substrate that makes concepts/3d-parallelism|3D parallelism (DP + TP + PP) tractable at 480-GPU scale. Tensor parallelism's per-layer all-reduce only fits within the NVLink domain; pipeline parallelism tolerates the slower InfiniBand hop between stages.
- Transformer Engine FP8 offers ~2× throughput over BF16 on supported kernels, though the eBay post does not disclose FP8 vs BF16 choice for e-Llama training.
Stub — expand as more sources cite H100 specifics (per-GPU TFLOPs, HBM configurations, FP8 adoption, power/cooling, lifetime, NVL72-class topologies).
Related¶
- systems/nvlink — intra-node GPU-to-GPU interconnect.
- systems/infiniband — inter-node fabric.
- systems/megatron-lm — the LLM training framework most commonly orchestrating H100 clusters in this wiki.
- systems/e-llama — an eBay model trained on an H100 fleet.
- systems/flash-attention-2 — the attention kernel tuned for H100.
- systems/grand-teton — Meta's OCP server platform modified to host H100 at 700 W.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — Meta's paired 24K-GPU H100 deployments.
- systems/llama-3 — trained on the 24K-GPU H100 clusters above.
- concepts/gpu-training-failure-modes — failure modes observed on H100 fleets at 24K-GPU scale.
- concepts/multi-gpu-serving / concepts/3d-parallelism