Skip to content

CONCEPT Cited by 1 source

Network-bound vs compute-bound

Definition

A system is network-bound when its scaling-limiting resource is network bandwidth (or packet rate) — adding more CPU / GPU does not increase throughput because bytes-per-second into or out of the host cap the rate at which useful work gets done. It is compute-bound when the scaling-limiting resource is CPU or GPU cycles — additional network capacity would not increase throughput because the host saturates its arithmetic units first.

A system can be bound on different resources at different tiers (e.g., root tier network-bound while leaf tier compute-bound) or change which resource it's bound on as architecture evolves.

Canonical Pinterest datum

From the 2026-05-01 Feature Trimmer post (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer):

"the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute."

Two named symptoms:

  1. Leaf partitions network-bound, GPUs idle: "peak network usage was significantly higher than peak GPU SM activity … the network bottleneck prevented us from fully utilizing the available GPU compute power."
  2. Root forced onto network-optimized instance types: "we had to use the network optimized AWS instance type m6in to ensure the server latency met our internal SLA." m6in is roughly 20% more expensive than standard m6i — a direct cost signal of network-boundedness.

After Feature Trimmer shipped, the bottleneck relocated to CPU on the root cluster: "It effectively shifted the bottleneck from network to CPU cycles on the root cluster." This is textbook bottleneck relocation: solving a network bottleneck moves the pressure to the next-most-binding resource.

Diagnostic signals

Network-bound

  • Utilisation shape: network metrics (bandwidth, PPS, or link saturation) track closely with latency; CPU / GPU utilisation is comfortably below capacity.
  • Instance type pressure: a tier is on a network-optimized instance type (AWS m6in / c6in / m7in, Azure Ludf, etc.) that's more expensive per-core than the standard flavour, purely for network throughput.
  • Fleet scaling axis: capacity planning is expressed in bandwidth units per second, not request-per-second per core. You scale by adding hosts to get more network, not more cores.
  • Latency responds to compression / payload reduction, not to faster CPUs. Pinterest's fbthrift lz4 lever (−20% bandwidth) reduced latency and cost at +5% CPU — net win because the bottleneck wasn't CPU.
  • Downstream resource underutilization: GPUs idle waiting for features, CPUs idle waiting for data.

Compute-bound

  • Utilisation shape: CPU / GPU utilisation at or near capacity; network headroom available.
  • Latency responds to algorithm / implementation improvements or to scale-out, not to payload reduction.
  • Cost shape: dominated by CPU / GPU instance hours, not by data transfer or network-optimized instance premiums.

The relocation dynamic

Solving one binding resource rarely yields unbounded throughput; it usually exposes the next-most-binding resource. Pinterest's post makes this explicit:

  • Pre-Feature-Trimmer: network-bound (root → leaf link + leaf inbound capacity).
  • Post-Feature-Trimmer: CPU-bound on root. "It effectively shifted the bottleneck from network to CPU cycles on the root cluster. This also allows the team to switch focus to optimizing the payload between the client and the root to further finetune the resource utilization end-to-end."

The Part II follow-up (client→root compression) is explicitly framed as the next lever once the root-leaf lever has played out.

Generalisable takeaway: architectural investment should be sequenced by which resource is currently binding — solving non-binding resources gives no throughput or cost win.

Why this is easy to mis-diagnose

  • Mixed-mode hosts: a host might be network-bound during peak hours and compute-bound at off-peak. Capacity planning in aggregate can miss the peak bottleneck.
  • Hidden network cost: RPC serialisation, cross-host fan-out payloads, and compression CPU all show up in ways that look compute-ish but are really network-driven ("we're burning CPU because we're serialising too much").
  • Multi-tier systems can be bound on different resources at different tiers simultaneously (Pinterest's root CPU-bound post-trimmer while Homefeed leaf rightsizing was still in progress).
  • GPU underutilization is often a network symptom, not a model-serving efficiency problem — if your GPU SM activity is low but GPU-allocated capacity is high, network fan-out is a strong suspect.

Sibling concepts on the wiki

Seen in

Caveats

  • Not the only two modes — systems can be memory-bound, disk-I/O-bound, lock-contention-bound, or bound on external dependencies (feature store, database, auth service). Network-vs-compute is the commonly-discussed pair at the inference-serving altitude.
  • Instance-type choice as diagnostic works at cloud-providers with network-optimized flavours (AWS m6in, c6in; GCP C3D network-optimized; etc.); less informative on bare-metal.
Last updated · 445 distilled / 1,275 read