CONCEPT Cited by 1 source

Injection bandwidth (AI cluster)¶

Definition¶

Injection bandwidth (in the context of an AI training cluster) is the per-accelerator network bandwidth at which a single GPU/accelerator can inject data into the fabric. It sets the upper bound on the pace at which gradients, activations, KV-cache slices, or tensor shards can be moved off an accelerator during collective communication — and therefore sets the ceiling on how much parallelism the fabric can support without becoming the bottleneck.

Meta's 2024-10 projection¶

Meta names a target regime for next-generation AI clusters:

"In the next few years, we anticipate greater injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a growth of more than an order of magnitude compared to today's networks!" (Source: sources/2024-10-15-meta-metas-open-ai-hardware-vision)

Two orthogonal claims are stacked:

Absolute target: ~1 TB/s per accelerator injection bandwidth.
Shape target: bisection bandwidth should scale in lockstep — the fabric must be non-oversubscribed at the cluster level, not just at the pod level.

Why it matters¶

Bounds the communication fraction of training. 3D-parallelism communication is latency-hidden behind compute only up to the point where injection bandwidth saturates; past that point, communication dominates wall-clock time.
Dictates fabric generation. 400G-per-GPU today, 800G mid-term, 1.6T + co-packaged-optics for the TB/s-class regime Meta projects.
Drives silicon-level investment. Meta's FBNIC is the in-house-NIC-ASIC response to hitting injection-bandwidth limits with off-the-shelf NICs.
Fat-flow problems get worse. As per-accelerator injection bandwidth grows, per-flow throughput grows; fat-flow load balancing and topology-aware collectives become more, not less, critical.

Seen in¶

sources/2024-10-15-meta-metas-open-ai-hardware-vision — the forward-looking TB/s-per-accelerator target.

concepts/bisection-bandwidth — the fabric-aggregate counterpart, Meta targets them in lockstep.
concepts/fat-flow-load-balancing / concepts/collective-communication-topology-awareness — the techniques needed for injection-bandwidth growth to translate into usable throughput.
systems/meta-dsf-disaggregated-scheduled-fabric — the fabric architecture being designed for this regime.
systems/fbnic — the in-house NIC response to the silicon-side demand.
systems/roce-rdma-over-converged-ethernet — endpoint protocol in use today.
companies/meta.

Injection bandwidth (AI cluster)¶

Definition¶

Meta's 2024-10 projection¶

Why it matters¶

Seen in¶

Related¶