Skip to content

META 2024-10-15 Tier 1

Read original ↗

Meta — Meta's open AI hardware vision

Summary

Meta's 2024-10-15 post — timed to the Open Compute Project (OCP) Global Summit 2024 — announces the next generation of Meta's AI-hardware stack and contributes the designs to OCP. Four headline disclosures: (1) Catalina, a new high-powered AI rack built on the NVIDIA Blackwell platform (GB200 Grace Blackwell Superchip), using the OCP ORv3 high-power rack (HPR) capable of up to 140 kW, liquid-cooled; (2) Grand Teton expanded to support AMD Instinct MI300X and contributed to OCP; (3) Disaggregated Scheduled Fabric (DSF) — Meta's vendor-agnostic AI networking backend, powered by OCP-SAI + FBOSS + Ethernet/RoCE, enabling multi-vendor endpoint/NIC/accelerator integration; plus new 51T fabric switches on Broadcom/Cisco ASICs and a first FBNIC module with Meta-designed network ASIC; (4) Mount Diablo, a Meta/Microsoft co-developed disaggregated power rack with a 400 VDC scalable unit. The post also projects forward: Meta anticipates injection bandwidth of ~1 TB/s per accelerator and matching normalized bisection bandwidth — more than an order-of-magnitude growth over today's AI fabrics.

Key takeaways

  1. Scaling trajectory named explicitly — an order-of-magnitude on network bandwidth is coming. "In the next few years, we anticipate greater injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a growth of more than an order of magnitude compared to today's networks!" Meta frames the supporting requirement: "a high-performance, multi-tier, non-blocking network fabric that can utilize modern congestion control to behave predictably under heavy load." This is the forward projection under which Catalina + DSF + FBNIC + 51T switches are being designed. (Source text)
  2. Cluster scale already at 24K × 2 and growing. "[Llama 3.1 405B] pushed our infrastructure to operate across more than 16,000 NVIDIA H100 GPUs… Today, we're training our models on two 24K-GPU clusters. We don't expect this upward trajectory for AI clusters to slow down any time soon." The post re-anchors the Meta training substrate from sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale (two 24K-GPU H100 clusters, Grand Teton @ 700 W air-cooled) as the previous generation. Catalina is the next-step platform. (Source text; see sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
  3. Catalina — 140 kW liquid-cooled ORv3 on NVIDIA GB200 Blackwell. "With Catalina we're introducing the ORv3, a high-power rack (HPR) capable of supporting up to 140kW. The full solution is liquid cooled and consists of a power shelf that supports a compute tray, switch tray, the ORv3 HPR, the Wedge 400 fabric switch, a management switch, battery backup unit, and a rack management controller." Built on the NVIDIA Blackwell platform full rack-scale solution, supporting the NVIDIA GB200 Grace Blackwell Superchip. Modularity and flexibility are the stated design principles — "to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards." Major shift from the Grand-Teton-@-700W-air-cooled approach; see systems/catalina-rack. (Source text)
  4. Grand Teton expanded to AMD MI300X — monolithic platform principle preserved. Meta's 2022-era Grand Teton AI platform (successor to Zion-EX, designed for DLRM + content-understanding workloads) gets a new variant supporting the AMD Instinct MI300X accelerator and the design is contributed to OCP. "Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads." Grand Teton is now a multi-accelerator platform (NVIDIA H100 + AMD MI300X); see patterns/modular-rack-for-multi-accelerator. (Source text)
  5. Disaggregated Scheduled Fabric (DSF) — vendor-agnostic AI backend. "Developing open, vendor-agnostic networking backend is going to play an important role going forward… Disaggregating our network allows us to work with vendors from across the industry to design systems that are innovative as well as scalable, flexible, and efficient." DSF "offers several advantages over our existing switches. By opening up our network fabric we can overcome limitations in scale, component supply options, and power density." Powered by OCP-SAI + FBOSS (Meta's own network operating system since 2018) + Ethernet RoCE to endpoints. Multi-vendor NIC/GPU support: NVIDIA + Broadcom + AMD named explicitly. See systems/meta-dsf-disaggregated-scheduled-fabric + concepts/network-fabric-disaggregation. (Source text)
  6. 51T fabric switches + FBNIC. "We have also developed and built new 51T fabric switches based on Broadcom and Cisco ASICs. Finally, we are sharing our new FBNIC, a new NIC module that contains our first Meta-design network ASIC." Silicon-level response to the projected TB/s-per-accelerator bandwidth. FBNIC is Meta's first in-house network ASIC — vertical integration step analogous to the server/rack self-design lineage (OCP, Grand Teton). (Source text)
  7. Mount Diablo (with Microsoft) — disaggregated 400 VDC power rack. "Our current collaboration focuses on Mount Diablo, a new disaggregated power rack. It's a cutting-edge solution featuring a scalable 400 VDC unit that enhances efficiency and scalability. This innovative design allows more AI accelerators per IT rack, significantly advancing AI infrastructure." Disaggregates the power rack from the IT rack — the same architectural stance as DSF at the network level, applied to power delivery. Higher voltage (400 VDC vs conventional 48 VDC OCP) reduces current / copper / losses for the same kW. See systems/mount-diablo-power-rack + concepts/400-vdc-rack-power. (Source text)
  8. The open-hardware thesis stated explicitly. "Scaling AI at this speed requires open hardware solutions… By investing in open hardware, we unlock AI's full potential and propel ongoing innovation in the field." And later: "We also need open AI hardware systems. These systems are necessary for delivering the kind of high-performance, cost-effective, and adaptable infrastructure necessary for AI advancement." Meta positions OCP-contribution as the natural consequence of the scaling curve — closed-source hardware cannot keep pace. Canonical patterns/open-hardware-for-ai-scaling. (Source text)
  9. Meta × Microsoft OCP lineage named. "Meta and Microsoft have a long-standing partnership within OCP, beginning with the development of the Switch Abstraction Interface (SAI) for data centers in 2018." Other joint contributions: Open Accelerator Module (OAM) standard, SSD standardization, and now Mount Diablo. See patterns/co-design-with-ocp-partners. (Source text)

Systems / hardware extracted

Concepts extracted

Existing concepts reinforced:

  • concepts/rack-level-power-density — Catalina's 140 kW ORv3 extends the upper bound disclosed on the wiki (Dropbox's 7th-gen sits at ~16 kW/rack air-cooled; Catalina at 140 kW liquid-cooled is an order-of-magnitude-plus delta).

Patterns extracted

  • patterns/open-hardware-for-ai-scaling — Meta's thesis: AI scale requires the hardware layer to move at the pace of the software layer, which requires open-source contribution rather than vendor-locked designs.
  • patterns/modular-rack-for-multi-accelerator — Grand Teton's "single monolithic system design with fully integrated power, control, compute, and fabric interfaces" extended across NVIDIA + AMD accelerators; Catalina extending the pattern to GB200.
  • patterns/co-design-with-ocp-partners — Meta × Microsoft lineage (SAI 2018 → OAM → Mount Diablo 2024) as the operational model.

Operational numbers

  • Catalina rack power: up to 140 kW (ORv3 HPR), liquid-cooled.
  • Mount Diablo: 400 VDC scalable unit.
  • Fabric switches: 51 Tbps on Broadcom + Cisco ASICs.
  • Projected per-accelerator injection bandwidth: ~1 TB/s.
  • Projected bisection bandwidth: "equal normalized" to injection — i.e. non-oversubscribed at the fleet level.
  • Current training scale anchors: Llama 3.1 405B at > 16,000 H100 GPUs on 15T tokens; two concurrent 24,000-GPU training clusters today.
  • Growth target: > 10× bandwidth scale vs today's networks.

Caveats

  • Announcement voice, not retrospective. The post is keyed to OCP Summit 2024; it announces designs rather than reports on operational production experience. No Catalina production numbers, no DSF deployment scale, no FBNIC silicon perf data.
  • Open-source release timing not fully specified — the post says "upcoming release" for Catalina; Grand-Teton-with-MI300X is being contributed to OCP; exact availability dates not disclosed.
  • No disclosure of Catalina GPU count per rack — the "single monolithic system design" principle for the prior Grand Teton is preserved in Catalina's Blackwell variant, but the exact compute-tray count + GPU count per rack is not given.
  • FBNIC feature set not disclosed. Meta names it as "first Meta-design network ASIC" — packet-processing feature set, software offload model, pipeline depth, or any perf data are not in this post.
  • Mount Diablo deployment timing not given. The collaboration is described as "current" but power-rack availability dates are not disclosed.
  • No comparison of Catalina vs Grand-Teton-H100 TCO, no disclosure of how Catalina's 140 kW + liquid cooling change data-center facility design (CRAH, CDU, manifold density). Implicit: they change everything.
  • The "growth of more than an order of magnitude" is the post's own forward projection, not a published roadmap milestone.
  • Llama 3.1 405B training was on 16K H100s per this post, consistent with the 2024-06-12 post's two 24K-H100 clusters (training uses a subset of a cluster); this is not a contradiction, it's Meta disclosing the subset-scale for one specific training run.

Source

Last updated · 319 distilled / 1,201 read