CONCEPT Cited by 3 sources

Inference vs training workload shape¶

Definition¶

Inference and training are fundamentally different workload shapes, and the infrastructure design that fits each is correspondingly different. The clearest statement in the wiki comes from Fly.io (2024-08-15):

Training workloads tend to look more like batch jobs, and inference tends to look more like transactions. Batch training jobs aren't that sensitive to networking or even reliability. Live inference jobs responding to end-user HTTP requests are.

(Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)

This concept is the workload-shape half of the training / serving boundary story. The two concepts are complementary:

Training-serving-boundary is about the organisational / infrastructure split — and whether that split is eroding at the frontier-model end of the spectrum because compute requirements converge.
Inference-vs-training-workload-shape is about the inherent request shape — and how that shape drives different infra choices for latency, reliability, networking, hardware selection, and storage proximity.

The two shapes¶

Training (batch)¶

Request shape: long-running job, minutes to months.
Latency sensitivity: low. A checkpoint in 5 minutes vs 6 is rounding error.
Reliability sensitivity: low-to-moderate. Checkpointing recovers from preemption or host loss; the frameworks (Megatron-LM, PyTorch distributed) are built assuming failure is the norm.
Networking: intense within the cluster (tensor / pipeline parallelism over NVLink + InfiniBand) but not sensitive to WAN networking — data is staged once and consumed repeatedly.
Hardware fit: HBM-capacity-first parts — A100 SXM, H100 SXM — with NVLink/NVSwitch fabrics to close the tensor-parallel communication hop.

Inference (transaction)¶

Request shape: single request → single response, typically tens of ms to a few seconds.
Latency sensitivity: high. It is part of an end-user HTTP request; the p99 is a user-visible number.
Reliability sensitivity: high. Every failed inference is a user-visible error.
Networking: sensitive to WAN latency + bandwidth between the user and the GPU, and between the GPU and its object storage for model parameters / datasets. Anycast + edge placement matter.
Hardware fit: "capable enough"-class cards — A10, L40S — co-resident with fast object storage, often without NVLink/NVSwitch.

Why this concept matters¶

It explains the Fly.io customer-data surprise. Fly.io expected demand for fractional-A100 slicing and NVLink-ganged training clusters; actual demand concentrated on the least capable card (A10) because that's what fits transaction-shaped inference, and Fly.io's customer base is primarily inference.
It is the architectural basis for GPU + object-storage co-location. Transaction-shape workloads can't pay a WAN round-trip to a separate object store per request; the pattern collapses both onto one platform.
It is the architectural basis for downmarket inference cards. If inference is transaction-shape, the rational card is the one that is capable enough + priced right + locally networked, not the frontier part. The L40S is explicitly positioned as this card by Fly.io.
It is a counterpoint to the training/serving-boundary erosion claim. Vogels / SageMaker argue the two workloads are converging on compute at the frontier-model end. Fly.io's argument is that for most production inference, the two workloads are still shape-divergent, and infra design that assumes they're the same under-delivers. Both framings are in tension and both are well-founded. The wiki notes both rather than picking one.

Implications for infra choice¶

Axis	Training fit	Inference fit
GPU part	SXM + NVLink; HBM-first	PCIe + rack-form-factor; capacity-adequate
Storage path	Staged once, read-heavy	Per-request; locality matters
Network path	Cluster-local (NVLink / IB)	WAN + edge (Anycast)
Reliability	Checkpoint + retry	Per-request resilience
Cost basis	GPU-hours-as-budget (batch amortises)	$/request (long tail of idle)

Seen in (wiki)¶

sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half — Fly.io's canonical statement of the shape distinction; the pricing decision to drop L40S to A10 prices is the downstream consequence.
sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels's framing of the compute-convergence claim at the frontier-model end; complement / counterpoint to Fly.io's shape-divergence framing.
sources/2025-02-14-flyio-we-were-wrong-about-gpus — Fly.io's 2025-02 retrospective re-affirms the shape-divergence framing and adds a demand-side overlay: even for transaction-shape workloads where the shape analysis says a GPU should be well-placed, most application developers want an LLM API, not a GPU. Inference-shape GPU workloads are still a real market, but they're a smaller one than the workload-shape analysis alone would suggest.

concepts/training-serving-boundary — the organisational / infrastructure split; paired concept.
concepts/inference-compute-storage-network-locality — the infra-level consequence of the inference shape.
patterns/co-located-inference-gpu-and-object-storage — the canonical pattern for the inference shape.
patterns/workload-segregated-clusters — the broader idea of running dedicated clusters per workload shape.