PATTERN Cited by 1 source
Single SageMaker training job train-and-infer¶
Problem¶
The textbook ML serving stack has a clean separation between training (produces model artifact, writes to S3) and inference (SageMaker endpoint / hosted model server / batch transform job that loads the artifact and serves predictions).
That separation makes sense for heavy models (deep-learning, large-language-model) but buys nothing for lightweight models whose inference footprint is already satisfied by the training job's instance — and in fact costs you:
- A separate endpoint to provision, monitor, auto-scale, roll over.
- An artifact upload + download round-trip through S3.
- A model-versioning indirection.
- Checkpointing logic.
- Endpoint warm-up / cold-start latency.
For a weekly batch inference job with a lightweight model, this is overhead without benefit.
Pattern¶
Run training and batch inference in the same SageMaker Training Job:
- Training job spins up, pulls training data from S3.
- Trains the model in-memory.
- Runs inference against the input dataset in the same process — no artifact serialisation, no endpoint invocation.
- Writes predictions to S3.
- Terminates.
The key constraint: the model must be lightweight enough that:
- Training completes well within the job's wall-clock budget leaving time for inference.
- Inference over the full input dataset fits in the job's instance memory / disk.
- No GPU-expensive inference workload that would benefit from specialised inference hardware.
Why¶
"Due to the ML model's lightweight training footprint, we bypass complexity, like for example not needing checkpointing, or separate infrastructure for inference. Instead, model training as well as model inference are executed in a single pipeline using AWS SageMaker Training Jobs. This approach reduces complexity, lowers infrastructure costs, and accelerates the pipeline."
Three concrete savings:
- No checkpointing. Model state lives in-process; no S3 write / read cycle.
- No separate inference infra. No SageMaker Endpoint, no model-hosting pipeline, no endpoint rollover on retrain.
- Pipeline acceleration. Fewer steps → lower wall-clock.
Canonical instance (Zalando ZEOS)¶
systems/zeos-demand-forecaster uses a single SageMaker Training Job to train LightGBM (via Nixtla MLForecast) and run batch inference over 5M SKUs × 12-week horizon. LightGBM is the load-bearing choice — deep-learning models like TFT (which Zalando tried) wouldn't fit this pattern.
When the pattern is wrong¶
- Real-time inference required.
- Model is too heavy for in-job inference (GPUs needed, inference >> training time, memory doesn't fit).
- Retraining cadence much slower than inference cadence (you want to re-infer daily without retraining).