SYSTEM Cited by 1 source
Catalina (Meta OCP AI rack)¶
Catalina is Meta's next-generation high-powered AI rack, announced at OCP Global Summit 2024. It is built on the NVIDIA Blackwell platform full rack-scale solution, supporting the NVIDIA GB200 Grace Blackwell Superchip, and is the successor in Meta's hardware lineage to the air-cooled Grand Teton platform that underpinned the two 24K-GPU H100 training clusters.
Configuration¶
Catalina introduces the ORv3 (Open Rack v3) high-power rack (HPR), "capable of supporting up to 140kW". The full solution is liquid-cooled (unlike Grand Teton's air-cooled 700 W H100 configuration) and consists of:
- A power shelf feeding a compute tray
- A switch tray
- The ORv3 HPR chassis
- The Wedge 400 fabric switch
- A management switch
- A battery backup unit
- A rack management controller
Design principles¶
"We aim for Catalina's modular design to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
Two stated principles:
- Modularity — consumers of the design can swap compute trays, fabric switches, and cooling stages to match their workload.
- Flexibility — multiple accelerator silicon generations should be accommodatable.
Positioning in Meta's AI-hardware lineage¶
- 2022 — Grand Teton introduced (air-cooled, NVIDIA H100 @ 700 W).
- 2024-06 — 2× 24K-GPU clusters on modified Grand Teton (Llama 3 training substrate; see sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale).
- 2024-10 — Grand Teton extended to AMD Instinct MI300X (multi-accelerator).
- 2024-10 — Catalina introduced for NVIDIA GB200 Blackwell at 140 kW liquid-cooled.
Catalina represents the break from the "data-center cooling infrastructure cannot change quickly" constraint named in the Grand-Teton-H100 2024-06 post. The rack-level power density jumps from Grand-Teton's air-cooled ≤ ~40 kW envelope to 140 kW via a full-liquid redesign.
Seen in¶
- sources/2024-10-15-meta-metas-open-ai-hardware-vision — the canonical Meta OCP 2024 announcement.
Why it matters¶
- First Meta AI rack > 100 kW. Canonical wiki instance of the > 100 kW liquid-cooled AI rack shape; complements concepts/rack-level-power-density's 16 kW air-cooled Dropbox datum at the opposite end of the power-density spectrum.
- OCP-contributed. Catalina is being contributed to OCP, which means the design propagates to other hyperscalers/NCPs and is not Meta-proprietary.
- Blackwell-generation proof point. Catalina is one of the first publicly-detailed rack-scale Blackwell platforms outside NVIDIA reference designs.
Related¶
- systems/orv3-rack — the 140 kW HPR chassis.
- systems/nvidia-gb200-grace-blackwell — the hosted silicon.
- systems/meta-wedge-400 — the fabric switch used.
- systems/grand-teton — the air-cooled H100 predecessor.
- concepts/liquid-cooled-ai-rack / concepts/rack-level-power-density — framing concepts.
- patterns/modular-rack-for-multi-accelerator / patterns/open-hardware-for-ai-scaling — patterns instantiated.
- companies/meta.