PATTERN Cited by 1 source
Build both fabric alternatives¶
Context¶
When choosing between two qualitatively different architectural substrates at hyperscale — each with deep operational-ecosystem implications, each with asserted-but-unverified tradeoffs — the usual analytical approach (spec comparison, benchmark, forecast) fails to deliver confident guidance. The tradeoffs live at operational scale, not at benchmark scale.
The canonical wiki instance is RoCE vs InfiniBand for AI training fabric at 20K+ GPU scale, where:
- Benchmark performance alone cannot predict production AI workload performance.
- Operational tooling maturity differs between the two.
- Neither organization has run either fabric at the target scale with the target workload.
- The decision, once made, is effectively irreversible for the 2-3 year lifetime of the cluster.
Pattern¶
Do not forecast. Build both at production scale, run the target workload on both, learn operationally, carry forward the learnings.
Concretely, Meta's 2024-06-12 execution of the pattern:
- Identify the fabric options. RoCE (prior Meta experience at 4K-GPU scale) vs InfiniBand (prior Meta experience at 16K-GPU research-cluster scale, non-production).
- Build at the target scale — both. 24K GPUs each (same GPU count, same GPU model, same server platform).
- Optimise each for its native strength.
- RoCE cluster: optimised for fast build time (leverage existing Ethernet operational tooling).
- InfiniBand cluster: optimised for full-bisection bandwidth (its native advantage).
- Tune both to workload parity. Apply the same three stack-level optimisations — parallelism-axis → topology-layer mapping, topology-aware collectives, fat-flow load balancing — to both until performance on GenAI workloads is equivalent.
- Run the canonical workload on both. Llama 3 trained on both 24K-GPU clusters; the largest model trained on the RoCE cluster.
- Feed learnings back into the next decision. "These learnings will inform the future direction of GenAI fabrics."
"We decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
When to use it¶
- Irreversible decisions at hyperscale where the operational-ecosystem dimension dominates the technical-merit dimension.
- Organization-level capacity to absorb the duplicated build cost. This is not a small-org pattern.
- Availability of both substrates in the required shape (vendor capacity, operational tooling, talent).
- A canonical workload that can be shared between the two builds — here, Llama 3 training.
When not to use it¶
- When one option is dominated on all axes (technical, operational, cost) — just pick it.
- When the duplicated build cost would crowd out the workload it's meant to serve. Meta can afford 48K H100s; most organizations cannot.
- When the two options share most of their failure modes — running both doesn't de-risk what they share.
Relationship to adjacent patterns¶
- patterns/dark-ship-for-behavior-parity — shares the "run both in parallel" shape, but dark-ship aims at verifying behaviour preservation before a migration. This pattern is the opposite: you keep both indefinitely to learn, without necessarily intending to consolidate.
- A/B testing on fabric choice is a false analogy — this isn't about end-user impact, it's about operational learning.
- "Two-pizza experiments" is too small a framing; this is tens-of-millions-of-dollars-scale parallel bets.
Why Meta framed it explicitly¶
Meta is unusually willing to announce the "we're hedging" framing publicly — most hyperscalers obscure it. The 2024-06-12 post's candor is architecturally significant:
"We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. Our intent was to build and learn from the operational experience."
Declaring "build and learn" makes two things explicit: (a) the team doesn't think the answer is knowable yet, (b) the organization can afford to not-know at this scale.
Wiki instances¶
- Meta 24K-GPU RoCE + 24K-GPU InfiniBand GenAI clusters (2024). Canonical wiki reference. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale) — see systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband.
Related¶
- systems/roce-rdma-over-converged-ethernet / systems/infiniband — the two fabric technologies.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two 24K-GPU deployments.
- concepts/fat-flow-load-balancing / concepts/collective-communication-topology-awareness — the stack-level optimisations Meta applied to both.
- companies/meta.