Skip to content

PATTERN Cited by 1 source

Warm node pool for cold-start reduction

Pattern

Maintain a pre-provisioned pool of Kubernetes nodes with the base runtime image already pulled, sized by a predictive algorithm. When the autoscaler needs to add a replica:

  1. Pick a node from the warm pool (node already up, image already local)
  2. Only the model artifact download remains
  3. Serve the model download from a hot cache in cloud storage with parallel chunk pulls

This collapses cold-start from "node boot + image pull + model download + init" down to "model download + init" — and with parallel hot-cache pulls, even the download phase is compressed.

Complementary mechanisms

  • Provisioned concurrency — for latency-critical endpoints that cannot tolerate any cold start, keep a minimum number of pods fully warm with the model loaded.
  • Zero-downtime updates — new pods are fully ready before traffic moves off old pods, avoiding update-triggered cold starts.
  • No-restart config changes — metadata/routing changes applied without pod restart.

Physics floor

"You cannot optimize cold starts away." — model initialization grows with model size (minutes for large GPU models). Past the optimizable phases, the only answer is provisioned concurrency.

Seen in

Last updated · 542 distilled / 1,571 read