CONCEPT Cited by 5 sources
Training / serving boundary¶
The training / serving boundary is the organizational and infrastructure split between the fleet that trains a model and the fleet that serves it in production. Historically, the split made sense: training was batch, bandwidth-hungry, memory-light-per-accel, tolerant of preemption; serving was low-latency, memory-resident, steady-state. The two workloads wanted different hardware, different placement, different schedulers.
Why it's eroding¶
For foundation models, the two workloads have converged:
- Both pin large model weights into GPU memory.
- Both want large-memory, high-interconnect GPUs.
- Both now use the same frameworks (PyTorch, JAX) and similar kernels.
- The infra that trains a 70B-parameter model is well-shaped to serve that same model on the same hardware.
Vogels's framing:
Organizations [are] maintaining separate infrastructure for training models and serving them in production, a pattern that made sense when these workloads had fundamentally different characteristics, but one that has become increasingly inefficient as both have converged on similar compute requirements.
(Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
Costs of keeping the boundary¶
- Underutilization. Training clusters are idle when no job is running; serving clusters are over-provisioned for peak. Two fleets = two "over-provisioning taxes."
- Operational complexity. Twice the IAM, twice the monitoring, twice the upgrade cadence, twice the on-call.
- Data motion. Checkpoint artefact has to move between fleets on every deploy. Non-trivial at hundreds-of-GB model scale.
- Deployment latency. Train → publish → schedule → warm up = deploy window measured in hours, not minutes.
Costs of collapsing it¶
- Scheduler must cohabit long-running training jobs with latency-sensitive inference. Mis-scheduling means a training job's synchronous collective stalls serving tail latency. Preemption / priority policies become the design center.
- Multi-tenancy on GPU is still harder than on CPU. MIG/MPS / GPU-direct isolation is less mature than KVM/cgroups.
- Capacity planning couples. A surge in training demand can starve serving if the pool is shared.
Architectural response¶
HyperPod's model-deployment capability (2025) is explicitly about collapsing this boundary: train a foundation model on a HyperPod cluster, deploy it for inference on that same cluster, without a separate provisioning step. The product is pitched as a utilization win (one fleet instead of two) and a complexity win (one operational substrate). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
This is adjacent to, but distinct from, concepts/compute-storage-separation (OLAP): both are about collapsing artificial fleet splits, but this is compute ↔ compute (training vs inference GPU fleets) whereas compute-storage-separation is compute ↔ storage (query engines vs. columnar files).
Seen in¶
- sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels frames the split as a historical artefact; HyperPod model deployment crosses it by running training + inference on the same compute.
- sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — adjacent but distinct: the feature store keeps training and serving consistent on feature values (the same PySpark transformations feed both offline + online stores) even though the compute fleets remain separate. Same discipline — unified feature values — applied across the boundary, not collapsing it.
- sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline is a third flavour of boundary discipline: the training-time model (large generative teacher — StyleGAN2 then Imagen) and the serving-time model (small on-device student — UNet + MobileNet) are structurally different models bridged by knowledge distillation. HyperPod collapses the compute fleets; the feature store unifies feature values; distillation lets the teacher and student architectures themselves diverge so each can be shaped for its fleet's constraints.
- sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — a training-only retrospective: eBay's e-Llama continued pretraining is meticulous on the training side (480 H100s × Megatron-LM 3D parallelism × 1T tokens × benchmark methodology × 1-month wall-clock) and entirely silent on the serving side (no inference backend, no per-query latency, no QPS, no cost-per-token, no product-surface integration). Data point on how sparse the serving-infra half of this boundary remains in the public record, even when the training-infra half is disclosed in detail.
- sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half — counter-framing instance. Fly.io's thesis is that the two workloads are still shape-divergent in production (training = batch, inference = transactions; different sensitivities to networking + reliability), and that infra design that treats them as converged under-delivers for inference. Useful tension with Vogels's compute-convergence framing: the two claims operate at different ends of the spectrum (Vogels at frontier-model training
- serving; Fly.io at mainstream inference). See the paired concept concepts/inference-vs-training-workload-shape.
Related¶
- systems/aws-sagemaker-hyperpod
- concepts/compute-storage-separation
- concepts/feature-store — adjacent concept: unifies feature values across training and serving; this concept unifies compute fleets. Same direction of travel, different axis.
- concepts/knowledge-distillation — a different boundary-crossing discipline: teacher / student architectures diverge; only the student ships.
- concepts/inference-vs-training-workload-shape — the workload-shape-divergence framing; paired concept.
- patterns/teacher-student-model-compression — the engineering pattern that operationalises distillation as a deployment shape.
- patterns/co-located-inference-gpu-and-object-storage — the inference-side infra pattern that follows from shape divergence.