CONCEPT Cited by 7 sources

Training / serving boundary¶

The training / serving boundary is the organizational and infrastructure split between the fleet that trains a model and the fleet that serves it in production. Historically, the split made sense: training was batch, bandwidth-hungry, memory-light-per-accel, tolerant of preemption; serving was low-latency, memory-resident, steady-state. The two workloads wanted different hardware, different placement, different schedulers.

Why it's eroding¶

For foundation models, the two workloads have converged:

Both pin large model weights into GPU memory.
Both want large-memory, high-interconnect GPUs.
Both now use the same frameworks (PyTorch, JAX) and similar kernels.
The infra that trains a 70B-parameter model is well-shaped to serve that same model on the same hardware.

Vogels's framing:

Organizations [are] maintaining separate infrastructure for training models and serving them in production, a pattern that made sense when these workloads had fundamentally different characteristics, but one that has become increasingly inefficient as both have converged on similar compute requirements.

(Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Costs of keeping the boundary¶

Underutilization. Training clusters are idle when no job is running; serving clusters are over-provisioned for peak. Two fleets = two "over-provisioning taxes."
Operational complexity. Twice the IAM, twice the monitoring, twice the upgrade cadence, twice the on-call.
Data motion. Checkpoint artefact has to move between fleets on every deploy. Non-trivial at hundreds-of-GB model scale.
Deployment latency. Train → publish → schedule → warm up = deploy window measured in hours, not minutes.

Costs of collapsing it¶

Scheduler must cohabit long-running training jobs with latency-sensitive inference. Mis-scheduling means a training job's synchronous collective stalls serving tail latency. Preemption / priority policies become the design center.
Multi-tenancy on GPU is still harder than on CPU. MIG/MPS / GPU-direct isolation is less mature than KVM/cgroups.
Capacity planning couples. A surge in training demand can starve serving if the pool is shared.

Architectural response¶

HyperPod's model-deployment capability (2025) is explicitly about collapsing this boundary: train a foundation model on a HyperPod cluster, deploy it for inference on that same cluster, without a separate provisioning step. The product is pitched as a utilization win (one fleet instead of two) and a complexity win (one operational substrate). (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

This is adjacent to, but distinct from, concepts/compute-storage-separation (OLAP): both are about collapsing artificial fleet splits, but this is compute ↔ compute (training vs inference GPU fleets) whereas compute-storage-separation is compute ↔ storage (query engines vs. columnar files).

Seen in¶

sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — Vogels frames the split as a historical artefact; HyperPod model deployment crosses it by running training + inference on the same compute.
sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — adjacent but distinct: the feature store keeps training and serving consistent on feature values (the same PySpark transformations feed both offline + online stores) even though the compute fleets remain separate. Same discipline — unified feature values — applied across the boundary, not collapsing it.
sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — YouTube's real-time generative AI effects pipeline is a third flavour of boundary discipline: the training-time model (large generative teacher — StyleGAN2 then Imagen) and the serving-time model (small on-device student — UNet + MobileNet) are structurally different models bridged by knowledge distillation. HyperPod collapses the compute fleets; the feature store unifies feature values; distillation lets the teacher and student architectures themselves diverge so each can be shaped for its fleet's constraints.
sources/2025-01-17-ebay-scaling-large-language-models-for-e-commerce-the-development — a training-only retrospective: eBay's e-Llama continued pretraining is meticulous on the training side (480 H100s × Megatron-LM 3D parallelism × 1T tokens × benchmark methodology × 1-month wall-clock) and entirely silent on the serving side (no inference backend, no per-query latency, no QPS, no cost-per-token, no product-surface integration). Data point on how sparse the serving-infra half of this boundary remains in the public record, even when the training-infra half is disclosed in detail.
sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half — counter-framing instance. Fly.io's thesis is that the two workloads are still shape-divergent in production (training = batch, inference = transactions; different sensitivities to networking + reliability), and that infra design that treats them as converged under-delivers for inference. Useful tension with Vogels's compute-convergence framing: the two claims operate at different ends of the spectrum (Vogels at frontier-model training
serving; Fly.io at mainstream inference). See the paired concept concepts/inference-vs-training-workload-shape.
sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution — Lyft's Feature Store is a second instance of the Dropbox-framed "unify feature values across training + serving" boundary discipline. Offline path (Hive) feeds training; online path (dsfeatures) feeds serving; the generated Airflow DAGs write to both simultaneously from the same SparkSQL computation. Same direction as HyperPod's compute-fleet unification, different axis (feature-value consistency).
sources/2026-02-27-pinterest-bridging-the-gap-online-offline-discrepancy-l1-cvr — Pinterest's L1 CVR O/O discrepancy retrospective is the wiki's canonical failure-mode instance of a poorly-bridged training / serving boundary. Features that existed in training logs but not in the L1 embedding build path (a serving artifact distinct from the L2 Feature Store), plus [query
item tower embedding version skew](<./embedding-version-skew.md>) in a two-tower model, combined to produce 20–45% offline LogMAE reduction that turned into "neutral or slightly worse CPA" online. Shows the boundary concept at a higher resolution than the feature-store framings: "it's not enough for features to exist in training logs or the Feature Store — they also need to be present in the serving artifacts (like ANN indices) that L1 actually uses to serve traffic." Introduces feature parity audit + three-layer O/O diagnosis as the debugging disciplines for this class of boundary-crossing bug.

systems/aws-sagemaker-hyperpod
concepts/compute-storage-separation
concepts/feature-store — adjacent concept: unifies feature values across training and serving; this concept unifies compute fleets. Same direction of travel, different axis.
concepts/knowledge-distillation — a different boundary-crossing discipline: teacher / student architectures diverge; only the student ships.
concepts/inference-vs-training-workload-shape — the workload-shape-divergence framing; paired concept.
patterns/teacher-student-model-compression — the engineering pattern that operationalises distillation as a deployment shape.
patterns/co-located-inference-gpu-and-object-storage — the inference-side infra pattern that follows from shape divergence.