PATTERN Cited by 1 source

Warm node pool for cold-start reduction¶

Pattern¶

Maintain a pre-provisioned pool of Kubernetes nodes with the base runtime image already pulled, sized by a predictive algorithm. When the autoscaler needs to add a replica:

Pick a node from the warm pool (node already up, image already local)
Only the model artifact download remains
Serve the model download from a hot cache in cloud storage with parallel chunk pulls

This collapses cold-start from "node boot + image pull + model download + init" down to "model download + init" — and with parallel hot-cache pulls, even the download phase is compressed.

Complementary mechanisms¶

Provisioned concurrency — for latency-critical endpoints that cannot tolerate any cold start, keep a minimum number of pods fully warm with the model loaded.
Zero-downtime updates — new pods are fully ready before traffic moves off old pods, avoiding update-triggered cold starts.
No-restart config changes — metadata/routing changes applied without pod restart.

Physics floor¶

"You cannot optimize cold starts away." — model initialization grows with model size (minutes for large GPU models). Past the optimizable phases, the only answer is provisioned concurrency.

Seen in¶

sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — Databricks Custom Model Serving warm-pool disclosure.