Skip to content

CONCEPT Cited by 1 source

CUDA throughput budget

Definition

CUDA throughput budget is the GPU-throughput cost profile of a serving workload — the queries-per-second per GPU (or equivalently per-request GPU time) that a given model consumes under production load. It is the sequencing axis for decisions about which workloads can be merged, which need isolation, and which need dedicated efficiency work before scaling.

Different product surfaces / workloads in the same organisation can have radically different CUDA throughput budgets: a short-context lightweight scoring model might serve 10k+ QPS per GPU, while a long-context LLM-scale ranker might serve <100 QPS per GPU on the same hardware. Attempting to unify these workloads onto one model without pairing them by throughput profile risks either (a) blowing the latency/cost budget for the cheaper workload or (b) throttling the expensive workload's quality.

Canonical wiki instance — Pinterest's staged unification

Pinterest's unified ads engagement model explicitly sequences unification by CUDA throughput (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"Since the cost of Related Pins (RP), Home Feed (HF), and Search (SR) differ substantially, we first unified Home Feed and Search (similar CUDA throughput characteristics) and expanded to Related Pins only after throughput and efficiency work stabilized."

The practical logic:

  • HF and SR have similar per-request GPU cost → share architecture, share trunk, share training data → unified model is viable.
  • RP has substantially higher per-request GPU cost → efficiency work must land first (request-level broadcasting, projection layers, fused kernels) before RP's traffic can be added to the unified model.

Why CUDA throughput is the right axis

  • It's the serving-cost denominator. Infra cost ≈ traffic × (1 / GPU throughput) × GPU hourly rate. Two workloads with matched throughput share the same serving-economics regime; workloads with mismatched throughput need separate optimisation work.
  • Throughput differences often exceed 10×. Cost-mismatched workloads can't share model capacity without one of them paying heavily.
  • It's measurable in isolation. Per-workload throughput can be benchmarked before unification is attempted.
  • It's independent of model quality. A unified model is valuable if the shared trunk helps generalisation; pairing workloads by throughput ensures the cost side of the equation doesn't kill the value side.
  • MFU — Model FLOPs Utilisation — how close the actual realised throughput is to hardware peak. Low MFU means the CUDA throughput budget isn't being fully spent on useful compute.
  • Per-request tail latency — the p99 budget that constrains the acceptable CUDA time per request.
  • Batch-size / throughput curve — throughput typically rises with batch size up to hardware saturation; the operating point of a model is a compromise between latency and throughput.

Generalisations

The sequencing principle — pair workloads by serving-cost profile, unify cheap pairs first — generalises beyond GPUs:

  • TPU throughput budget for equivalent TPU-served workloads.
  • CPU throughput budget for CPU-bound scoring or rules engines.
  • Memory bandwidth budget for embedding-heavy models where HBM bandwidth dominates cost.

Pinterest's specific phrasing uses "CUDA throughput" because its ads ranking stack is GPU-served.

Caveats

  • Pinterest doesn't disclose absolute throughput numbers for HF, SR, or RP. The "similar" / "substantially higher" framing is qualitative.
  • CUDA-specific framing assumes NVIDIA GPUs. The same sequencing logic applies to other accelerators but the benchmark tool differs.
  • Throughput alone is necessary but not sufficient — two workloads with matched throughput but incompatible feature sets or task objectives still may not merge cleanly.
  • Throughput profile can shift after unification. Pinterest's baseline unification "materially increased serving cost" — the merged model's throughput can be worse than both originals until efficiency work lands.

Seen in

Last updated · 319 distilled / 1,201 read