Skip to content

SYSTEM Cited by 2 sources

Grand Teton

Grand Teton is Meta's AI training/inference server platform, introduced at Open Compute Summit 2022. It is the chassis under the 24K-GPU RoCE and InfiniBand clusters on which Meta trained Llama 3. In 2024-10 Meta extended the platform to host AMD Instinct MI300X accelerators and contributed the new version to OCP.

2024 H100 adaptation

For the H100 rollout, Meta modified the original NVIDIA H100-based Grand Teton platform with three substantial changes:

  • GPU TDP increased to 700 W (from stock).
  • HBM3 memory on the GPUs (upgraded from earlier HBM2e).
  • Kept air cooling rather than moving to liquid — "Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment." The mechanical and thermal design had to change to accommodate this, and a full validation cycle was triggered to support large-scale deployment.

"All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

2024-10 AMD MI300X extension

At OCP Summit 2024, Meta announced that Grand Teton has been extended to support the AMD Instinct MI300X accelerator, and contributed the new version to OCP:

"Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

This makes Grand Teton the canonical wiki instance of modular rack for multi-accelerator: one monolithic platform, multiple accelerator vendors (NVIDIA H100 + AMD MI300X), with OAM-compliant module form factor enabling the swap.

Grand Teton is positioned for inference workloads in the AMD-MI300X framing, complementing its prior training-centric role with NVIDIA H100. The next-generation Catalina rack supersedes Grand Teton for the NVIDIA Blackwell (GB200) generation and the 140 kW liquid-cooled regime.

Seen in (wiki)

Why it matters

  • Open Compute footprint. Grand Teton is an open-sourced platform — Meta's hardware decisions feed back into the OCP ecosystem. The decision to run H100 at 700 W and HBM3 is publicly visible because of this.
  • Constraint-shaped design. The H100 adaptation is a canonical case of data-center infrastructure constraining silicon deployment: the platform's mechanical design was constrained by the inability to change cooling quickly, not by silicon availability.
  • Multi-accelerator platform. The MI300X extension makes Grand Teton the canonical wiki instance of a hyperscale AI platform designed to host multiple accelerator-vendor silicon generations on the same chassis.
  • Scale-step validation. Meta describes a "validation cycle to support large-scale deployment" — at 24K-GPU fleet sizes, hardware-platform bring-up is a load-bearing project in its own right.
Last updated · 319 distilled / 1,201 read