SYSTEM Cited by 2 sources

Grand Teton¶

Grand Teton is Meta's AI training/inference server platform, introduced at Open Compute Summit 2022. It is the chassis under the 24K-GPU RoCE and InfiniBand clusters on which Meta trained Llama 3. In 2024-10 Meta extended the platform to host AMD Instinct MI300X accelerators and contributed the new version to OCP.

2024 H100 adaptation¶

For the H100 rollout, Meta modified the original NVIDIA H100-based Grand Teton platform with three substantial changes:

GPU TDP increased to 700 W (from stock).
HBM3 memory on the GPUs (upgraded from earlier HBM2e).
Kept air cooling rather than moving to liquid — "Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment." The mechanical and thermal design had to change to accommodate this, and a full validation cycle was triggered to support large-scale deployment.

"All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

2024-10 AMD MI300X extension¶

At OCP Summit 2024, Meta announced that Grand Teton has been extended to support the AMD Instinct MI300X accelerator, and contributed the new version to OCP:

"Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

This makes Grand Teton the canonical wiki instance of modular rack for multi-accelerator: one monolithic platform, multiple accelerator vendors (NVIDIA H100 + AMD MI300X), with OAM-compliant module form factor enabling the swap.

Grand Teton is positioned for inference workloads in the AMD-MI300X framing, complementing its prior training-centric role with NVIDIA H100. The next-generation Catalina rack supersedes Grand Teton for the NVIDIA Blackwell (GB200) generation and the 140 kW liquid-cooled regime.

Seen in (wiki)¶

Meta 24K-GPU GenAI clusters (2024). Grand Teton (modified) is the platform under both the RoCE and InfiniBand 24K-GPU clusters on which Llama 3 was trained. Retaining air cooling — despite the 700 W per-GPU TDP — was a direct consequence of data-center power/cooling infrastructure being harder to change than silicon. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Meta Grand Teton AMD MI300X extension (2024-10). Grand Teton extended to support AMD Instinct MI300X; new variant contributed to OCP; positioned for large-scale AI inference workloads. (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

Why it matters¶

Open Compute footprint. Grand Teton is an open-sourced platform — Meta's hardware decisions feed back into the OCP ecosystem. The decision to run H100 at 700 W and HBM3 is publicly visible because of this.
Constraint-shaped design. The H100 adaptation is a canonical case of data-center infrastructure constraining silicon deployment: the platform's mechanical design was constrained by the inability to change cooling quickly, not by silicon availability.
Multi-accelerator platform. The MI300X extension makes Grand Teton the canonical wiki instance of a hyperscale AI platform designed to host multiple accelerator-vendor silicon generations on the same chassis.
Scale-step validation. Meta describes a "validation cycle to support large-scale deployment" — at 24K-GPU fleet sizes, hardware-platform bring-up is a load-bearing project in its own right.

systems/nvidia-h100 — the GPU the 2024 Grand Teton variant hosts at 700 W.
systems/amd-instinct-mi300x — the AMD accelerator the 2024-10 variant hosts.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two 24K-GPU clusters built on top.
systems/catalina-rack — the next-generation successor rack for GB200 Blackwell silicon + 140 kW liquid-cooled regime.
systems/oam-open-accelerator-module — the accelerator-module standard Grand Teton adheres to.
companies/meta — Meta's engineering-hardware broader portfolio.
concepts/hardware-reliability-at-scale — failure-rate implications of 700 W air-cooled nodes at 24K-GPU scale.
patterns/modular-rack-for-multi-accelerator — the pattern Grand Teton instantiates.
patterns/open-hardware-for-ai-scaling — the broader thesis OCP-contributed Grand Teton serves.

Grand Teton¶

2024 H100 adaptation¶

2024-10 AMD MI300X extension¶

Seen in (wiki)¶

Why it matters¶

Related¶