SYSTEM Cited by 2 sources
Grand Teton¶
Grand Teton is Meta's AI training/inference server platform, introduced at Open Compute Summit 2022. It is the chassis under the 24K-GPU RoCE and InfiniBand clusters on which Meta trained Llama 3. In 2024-10 Meta extended the platform to host AMD Instinct MI300X accelerators and contributed the new version to OCP.
2024 H100 adaptation¶
For the H100 rollout, Meta modified the original NVIDIA H100-based Grand Teton platform with three substantial changes:
- GPU TDP increased to 700 W (from stock).
- HBM3 memory on the GPUs (upgraded from earlier HBM2e).
- Kept air cooling rather than moving to liquid — "Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment." The mechanical and thermal design had to change to accommodate this, and a full validation cycle was triggered to support large-scale deployment.
"All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
2024-10 AMD MI300X extension¶
At OCP Summit 2024, Meta announced that Grand Teton has been extended to support the AMD Instinct MI300X accelerator, and contributed the new version to OCP:
"Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
This makes Grand Teton the canonical wiki instance of modular rack for multi-accelerator: one monolithic platform, multiple accelerator vendors (NVIDIA H100 + AMD MI300X), with OAM-compliant module form factor enabling the swap.
Grand Teton is positioned for inference workloads in the AMD-MI300X framing, complementing its prior training-centric role with NVIDIA H100. The next-generation Catalina rack supersedes Grand Teton for the NVIDIA Blackwell (GB200) generation and the 140 kW liquid-cooled regime.
Seen in (wiki)¶
- Meta 24K-GPU GenAI clusters (2024). Grand Teton (modified) is the platform under both the RoCE and InfiniBand 24K-GPU clusters on which Llama 3 was trained. Retaining air cooling — despite the 700 W per-GPU TDP — was a direct consequence of data-center power/cooling infrastructure being harder to change than silicon. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
- Meta Grand Teton AMD MI300X extension (2024-10). Grand Teton extended to support AMD Instinct MI300X; new variant contributed to OCP; positioned for large-scale AI inference workloads. (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
Why it matters¶
- Open Compute footprint. Grand Teton is an open-sourced platform — Meta's hardware decisions feed back into the OCP ecosystem. The decision to run H100 at 700 W and HBM3 is publicly visible because of this.
- Constraint-shaped design. The H100 adaptation is a canonical case of data-center infrastructure constraining silicon deployment: the platform's mechanical design was constrained by the inability to change cooling quickly, not by silicon availability.
- Multi-accelerator platform. The MI300X extension makes Grand Teton the canonical wiki instance of a hyperscale AI platform designed to host multiple accelerator-vendor silicon generations on the same chassis.
- Scale-step validation. Meta describes a "validation cycle to support large-scale deployment" — at 24K-GPU fleet sizes, hardware-platform bring-up is a load-bearing project in its own right.
Related¶
- systems/nvidia-h100 — the GPU the 2024 Grand Teton variant hosts at 700 W.
- systems/amd-instinct-mi300x — the AMD accelerator the 2024-10 variant hosts.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two 24K-GPU clusters built on top.
- systems/catalina-rack — the next-generation successor rack for GB200 Blackwell silicon + 140 kW liquid-cooled regime.
- systems/oam-open-accelerator-module — the accelerator-module standard Grand Teton adheres to.
- companies/meta — Meta's engineering-hardware broader portfolio.
- concepts/hardware-reliability-at-scale — failure-rate implications of 700 W air-cooled nodes at 24K-GPU scale.
- patterns/modular-rack-for-multi-accelerator — the pattern Grand Teton instantiates.
- patterns/open-hardware-for-ai-scaling — the broader thesis OCP-contributed Grand Teton serves.