Skip to content

CONCEPT Cited by 5 sources

Training / serving boundary

The training / serving boundary is the organizational and infrastructure split between the fleet that trains a model and the fleet that serves it in production. Historically, the split made sense: training was batch, bandwidth-hungry, memory-light-per-accel, tolerant of preemption; serving was low-latency, memory-resident, steady-state. The two workloads wanted different hardware, different placement, different schedulers.

Why it's eroding

For foundation models, the two workloads have converged:

  • Both pin large model weights into GPU memory.
  • Both want large-memory, high-interconnect GPUs.
  • Both now use the same frameworks (PyTorch, JAX) and similar kernels.
  • The infra that trains a 70B-parameter model is well-shaped to serve that same model on the same hardware.

Vogels's framing:

Organizations [are] maintaining separate infrastructure for training models and serving them in production, a pattern that made sense when these workloads had fundamentally different characteristics, but one that has become increasingly inefficient as both have converged on similar compute requirements.

(Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Costs of keeping the boundary

  • Underutilization. Training clusters are idle when no job is running; serving clusters are over-provisioned for peak. Two fleets = two "over-provisioning taxes."
  • Operational complexity. Twice the IAM, twice the monitoring, twice the upgrade cadence, twice the on-call.
  • Data motion. Checkpoint artefact has to move between fleets on every deploy. Non-trivial at hundreds-of-GB model scale.
  • Deployment latency. Train → publish → schedule → warm up = deploy window measured in hours, not minutes.

Costs of collapsing it

  • Scheduler must cohabit long-running training jobs with latency-sensitive inference. Mis-scheduling means a training job's synchronous collective stalls serving tail latency. Preemption / priority policies become the design center.
  • Multi-tenancy on GPU is still harder than on CPU. MIG/MPS / GPU-direct isolation is less mature than KVM/cgroups.
  • Capacity planning couples. A surge in training demand can starve serving if the pool is shared.

Architectural response

HyperPod's model-deployment capability (2025) is explicitly about collapsing this boundary: train a foundation model on a HyperPod cluster, deploy it for inference on that same cluster, without a separate provisioning step. The product is pitched as a utilization win (one fleet instead of two) and a complexity win (one operational substrate). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

This is adjacent to, but distinct from, concepts/compute-storage-separation (OLAP): both are about collapsing artificial fleet splits, but this is compute ↔ compute (training vs inference GPU fleets) whereas compute-storage-separation is compute ↔ storage (query engines vs. columnar files).

Seen in

  • sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels frames the split as a historical artefact; HyperPod model deployment crosses it by running training + inference on the same compute.
  • sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — adjacent but distinct: the feature store keeps training and serving consistent on feature values (the same PySpark transformations feed both offline + online stores) even though the compute fleets remain separate. Same discipline — unified feature values — applied across the boundary, not collapsing it.
  • sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline is a third flavour of boundary discipline: the training-time model (large generative teacher — StyleGAN2 then Imagen) and the serving-time model (small on-device student — UNet + MobileNet) are structurally different models bridged by knowledge distillation. HyperPod collapses the compute fleets; the feature store unifies feature values; distillation lets the teacher and student architectures themselves diverge so each can be shaped for its fleet's constraints.
  • sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — a training-only retrospective: eBay's e-Llama continued pretraining is meticulous on the training side (480 H100s × Megatron-LM 3D parallelism × 1T tokens × benchmark methodology × 1-month wall-clock) and entirely silent on the serving side (no inference backend, no per-query latency, no QPS, no cost-per-token, no product-surface integration). Data point on how sparse the serving-infra half of this boundary remains in the public record, even when the training-infra half is disclosed in detail.
  • sources/2024-08-15-flyio-were-cutting-l40s-prices-in-halfcounter-framing instance. Fly.io's thesis is that the two workloads are still shape-divergent in production (training = batch, inference = transactions; different sensitivities to networking + reliability), and that infra design that treats them as converged under-delivers for inference. Useful tension with Vogels's compute-convergence framing: the two claims operate at different ends of the spectrum (Vogels at frontier-model training
  • serving; Fly.io at mainstream inference). See the paired concept concepts/inference-vs-training-workload-shape.
Last updated · 200 distilled / 1,178 read