PATTERN Cited by 1 source

Modular rack for multi-accelerator¶

Context¶

Hyperscale AI infrastructure must:

Absorb silicon generations faster than the data-hall lifecycle (chassis ~7 years, silicon ~2 years).
Serve multiple accelerator vendors — both for supply resilience and for workload/silicon fit (training vs inference, large-model vs small-model).
Preserve fully integrated system design (unified power, control, compute, fabric) so that deployment remains rapid and reliable at fleet scale.

A naive approach — one chassis per silicon generation per vendor — multiplies engineering cost, operations complexity, and supply risk.

The pattern¶

Design a single platform chassis with standardised accelerator module slots (via OCP OAM) + standardised host/power/fabric integration; hold the chassis integration stable across silicon generations and across accelerator vendors.

The chassis becomes the stable substrate; the accelerator module is the variable element.

Meta's 2024-10 instances¶

Grand Teton — NVIDIA H100 → AMD MI300X¶

"Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

The Grand Teton platform — 2022 Zion-EX successor, originally NVIDIA-GPU-only — is extended in 2024-10 to host the AMD Instinct MI300X. Same monolithic-integration principle, new accelerator.

Catalina — NVIDIA GB200 Blackwell, liquid-cooled ORv3¶

"We aim for Catalina's modular design to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

Catalina extends the modular-rack-for-multi-accelerator principle into the Blackwell generation + 140 kW liquid-cooled regime. The chassis modularity is now specified at rack-scale, not just platform-scale.

Preconditions¶

Standardised accelerator module form factor — OAM is the OCP standard that makes this feasible.
Consistent host/accelerator interface — PCIe, NVLink/InfiniBand interconnect, and power delivery must remain compatible across silicon generations.
Vendor willingness to ship to the open standard — NVIDIA H100-SXM, AMD MI300X-OAM, and NVIDIA GB200 rack-scale solution are all publicly announced variants.

Trade-offs¶

Chassis must over-engineer for multiple use cases. Single-vendor chassis can be tuned precisely for one GPU; a multi-accelerator chassis has to accept some generic-design overhead.
Thermal envelope must be set for the hottest supported silicon. Cooling design must handle MI300X or H100 or GB200 without redesign — or accept that a new rack generation (Catalina vs Grand Teton) is needed for a silicon generation beyond the chassis's envelope.
Supply-chain simpler at the chassis level, still per-silicon at the accelerator level. The pattern shifts where supply risk lives, doesn't eliminate it.

systems/grand-teton — canonical multi-accelerator-capable chassis.
systems/catalina-rack — rack-scale successor for Blackwell.
systems/oam-open-accelerator-module — the module standard that makes the pattern feasible.
systems/nvidia-h100 / systems/amd-instinct-mi300x / systems/nvidia-gb200-grace-blackwell — the silicon variants that plug into Meta's platforms.
patterns/open-hardware-for-ai-scaling — the broader pattern this instantiates.
companies/meta.