PATTERN Cited by 1 source
Warm node pool for cold-start reduction¶
Pattern¶
Maintain a pre-provisioned pool of Kubernetes nodes with the base runtime image already pulled, sized by a predictive algorithm. When the autoscaler needs to add a replica:
- Pick a node from the warm pool (node already up, image already local)
- Only the model artifact download remains
- Serve the model download from a hot cache in cloud storage with parallel chunk pulls
This collapses cold-start from "node boot + image pull + model download + init" down to "model download + init" — and with parallel hot-cache pulls, even the download phase is compressed.
Complementary mechanisms¶
- Provisioned concurrency — for latency-critical endpoints that cannot tolerate any cold start, keep a minimum number of pods fully warm with the model loaded.
- Zero-downtime updates — new pods are fully ready before traffic moves off old pods, avoiding update-triggered cold starts.
- No-restart config changes — metadata/routing changes applied without pod restart.
Physics floor¶
"You cannot optimize cold starts away." — model initialization grows with model size (minutes for large GPU models). Past the optimizable phases, the only answer is provisioned concurrency.
Seen in¶
- sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — Databricks Custom Model Serving warm-pool disclosure.