SYSTEM Cited by 1 source
Llama 3¶
Llama 3 is Meta's April-2024 open-weights foundation-model release (8B, 70B; 405B followed as Llama 3.1). For this wiki, the operationally interesting fact is that Llama 3 was trained on both of Meta's 24K-GPU H100 clusters simultaneously: one using RoCE, one using InfiniBand. The largest Llama 3 model was trained on the RoCE cluster.
"We used both InfiniBand and RoCE clusters to train Llama 3, with the RoCE cluster used for training the largest model." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
This is a production-AI-fabric-comparison datum — rare in the open literature, and the reason Meta's 2024-06-12 post is architecturally significant.
Seen in (wiki)¶
- Meta — How Meta trains large language models at scale. The 2024-06-12 post is the canonical Meta statement on the Llama 3 training substrate: 24K-GPU H100 × 2 clusters (RoCE + InfiniBand), modified Grand Teton platform at 700 W, air-cooled. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Relation to Llama 3.1¶
Llama 3.1 is the July-2024 update to the Llama 3 family (8B / 70B / 405B), already documented elsewhere on this wiki as the adaptation base for eBay's e-Llama and Instacart's SRL model. The 2024-06-12 Meta infra post predates Llama 3.1's public release; the infra substrate described there is the substrate Llama 3.1 was subsequently trained on (Meta has not published a separate infrastructure retrospective for 3.1 at the time of this wiki entry).
Relationship:
- Llama 3 — April 2024, 8B + 70B release.
- Llama 3.1 — July 2024 update, 8B + 70B + 405B; same infra family.
Stub¶
More content to add as Meta publishes further infra retrospectives (ingest candidates: the Building Meta's GenAI Infrastructure 2024-03-12 post, Llama 3.1 model-card disclosures, the Llama 3 Herd paper).
Related¶
- systems/llama-3-1 — the 2024-07 successor family documented elsewhere in the wiki.
- systems/meta-genai-cluster-roce — the cluster that trained the largest Llama 3 model.
- systems/meta-genai-cluster-infiniband — the sibling cluster that trained other Llama 3 variants.
- systems/nvidia-h100 / systems/grand-teton — hardware substrate.
- companies/meta — Meta's company page.