SYSTEM Cited by 1 source
DSF (Disaggregated Scheduled Fabric)¶
Disaggregated Scheduled Fabric (DSF) is Meta's open, vendor-agnostic AI networking backend for next-generation training clusters, announced at OCP Summit 2024. It replaces the scale/component-supply/power-density limits of Meta's prior switch-based fabrics with a disaggregated architecture built on open standards.
Architecture¶
DSF is stacked on three open substrates:
- OCP-SAI — the Switch Abstraction Interface standard Meta + Microsoft co-developed in 2018 for OCP. Provides a vendor-agnostic API for network ASIC programming.
- FBOSS — Meta's own network operating system for controlling switches, open-sourced in 2018.
- Ethernet-based RoCE — standard Ethernet with RDMA-over-Converged-Ethernet as the endpoint-facing protocol, supporting NICs + accelerators from NVIDIA, Broadcom, and AMD.
Design motivation¶
"Developing open, vendor-agnostic networking backend is going to play an important role going forward as we continue to push the performance of our AI training clusters. Disaggregating our network allows us to work with vendors from across the industry to design systems that are innovative as well as scalable, flexible, and efficient." (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)
Three explicit limits DSF overcomes vs Meta's existing switches:
- Scale — DSF is designed for the > 1 TB/s-per-accelerator injection-bandwidth regime projected for the next few years.
- Component supply options — open standards + multi-vendor endpoint/NIC support reduce single-vendor dependency.
- Power density — disaggregation allows compute and switching to scale independently, avoiding vertically-integrated thermals.
Companion silicon contributed alongside DSF¶
- 51T fabric switches on Broadcom + Cisco ASICs — new Meta-built switch hardware.
- FBNIC — "a new NIC module that contains our first Meta-design network ASIC."
Positioning on the wiki¶
DSF is the next-step evolution past the 24K-GPU RoCE cluster whose fabric design was documented in sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale (SIGCOMM 2024). That fabric solved the 4K-to-24K scale-step with a vertically-integrated Meta-designed substrate; DSF opens the same vertical to multi-vendor alternatives, consistent with the open-hardware-for-AI-scaling thesis of the 2024-10 OCP post.
Seen in¶
- sources/2024-10-15-meta-metas-open-ai-hardware-vision — the OCP 2024 announcement.
Why it matters¶
- First wiki instance of explicit AI-fabric disaggregation. Prior wiki coverage (Meta 2024-08 RoCE, 2024-06 InfiniBand) is vertically-integrated fabric design; DSF is the disaggregated counterpart.
- Multi-vendor endpoint story matters for supply-chain resilience. Once NICs + GPUs from multiple vendors can plug into the same fabric via standard Ethernet-RoCE, accelerator choice becomes a workload-level decision instead of a fabric-lock-in decision.
Related¶
- systems/fboss-meta-network-os — the control-plane NOS.
- systems/ocp-sai — the vendor-agnostic switch API.
- systems/fbnic — Meta's first in-house NIC ASIC for the DSF era.
- systems/roce-rdma-over-converged-ethernet — the endpoint protocol.
- systems/meta-genai-cluster-roce — the predecessor fabric generation.
- concepts/network-fabric-disaggregation — the architectural stance.
- concepts/injection-bandwidth-ai-cluster / concepts/bisection-bandwidth — the scaling targets.
- patterns/open-hardware-for-ai-scaling — the broader thesis DSF instantiates.
- companies/meta.