PATTERN Cited by 1 source
Modular rack for multi-accelerator¶
Context¶
Hyperscale AI infrastructure must:
- Absorb silicon generations faster than the data-hall lifecycle (chassis ~7 years, silicon ~2 years).
- Serve multiple accelerator vendors — both for supply resilience and for workload/silicon fit (training vs inference, large-model vs small-model).
- Preserve fully integrated system design (unified power, control, compute, fabric) so that deployment remains rapid and reliable at fleet scale.
A naive approach — one chassis per silicon generation per vendor — multiplies engineering cost, operations complexity, and supply risk.
The pattern¶
Design a single platform chassis with standardised accelerator module slots (via OCP OAM) + standardised host/power/fabric integration; hold the chassis integration stable across silicon generations and across accelerator vendors.
The chassis becomes the stable substrate; the accelerator module is the variable element.
Meta's 2024-10 instances¶
Grand Teton — NVIDIA H100 → AMD MI300X¶
"Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
The Grand Teton platform — 2022 Zion-EX successor, originally NVIDIA-GPU-only — is extended in 2024-10 to host the AMD Instinct MI300X. Same monolithic-integration principle, new accelerator.
Catalina — NVIDIA GB200 Blackwell, liquid-cooled ORv3¶
"We aim for Catalina's modular design to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
Catalina extends the modular-rack-for-multi-accelerator principle into the Blackwell generation + 140 kW liquid-cooled regime. The chassis modularity is now specified at rack-scale, not just platform-scale.
Preconditions¶
- Standardised accelerator module form factor — OAM is the OCP standard that makes this feasible.
- Consistent host/accelerator interface — PCIe, NVLink/InfiniBand interconnect, and power delivery must remain compatible across silicon generations.
- Vendor willingness to ship to the open standard — NVIDIA H100-SXM, AMD MI300X-OAM, and NVIDIA GB200 rack-scale solution are all publicly announced variants.
Trade-offs¶
- Chassis must over-engineer for multiple use cases. Single-vendor chassis can be tuned precisely for one GPU; a multi-accelerator chassis has to accept some generic-design overhead.
- Thermal envelope must be set for the hottest supported silicon. Cooling design must handle MI300X or H100 or GB200 without redesign — or accept that a new rack generation (Catalina vs Grand Teton) is needed for a silicon generation beyond the chassis's envelope.
- Supply-chain simpler at the chassis level, still per-silicon at the accelerator level. The pattern shifts where supply risk lives, doesn't eliminate it.
Related¶
- systems/grand-teton — canonical multi-accelerator-capable chassis.
- systems/catalina-rack — rack-scale successor for Blackwell.
- systems/oam-open-accelerator-module — the module standard that makes the pattern feasible.
- systems/nvidia-h100 / systems/amd-instinct-mi300x / systems/nvidia-gb200-grace-blackwell — the silicon variants that plug into Meta's platforms.
- patterns/open-hardware-for-ai-scaling — the broader pattern this instantiates.
- companies/meta.