Skip to content

CONCEPT Cited by 1 source

VM lifetime prediction

VM lifetime prediction is the problem class of predicting how long a virtual machine will run — from creation to shutdown — as an input to placement scoring in a cloud scheduler. It's a subclass of performance prediction but with a distinct hazard profile that the LAVA paper's framing makes explicit.

Why it's load-bearing for bin-packing

VM allocation is online bin-packing with unknown disappearance times. If the scheduler knew each VM's lifetime, it could pack them so hosts drained together, preserving empty hosts and avoiding resource stranding. Without lifetime information the scheduler must either:

  • Ignore lifetime entirely — pack by current-state shape only, accepting stranding / empty-host loss as second-order costs.
  • Use a deterministic heuristic (job type → expected lifetime) — brittle when workloads diversify.
  • Learn a predictor — the ML-for-systems answer the LAVA paper operationalises.

The single-shot hazard

The 2025-10-17 post names a structural failure mode of naive lifetime prediction: "AI can help with this problem by using learned models to predict VM lifetimes. However, this often relies on a single prediction at the VM's creation. The challenge with this approach is that a single misprediction can tie up an entire host for an extended period, degrading efficiency." (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava).

Why is the cost asymmetric? Because the placement decision the prediction drives is expensive to reverse — moving a VM (live migration, restart, or drain-and-replace) is not a free re-decision. A misprediction at t=0 locks the scheduler into a suboptimal placement until either (a) the VM actually exits, (b) the scheduler pays migration cost to recover. Point prediction commits to the decision before most of the trajectory is observed.

The LAVA family's answer: continuous reprediction + learned

distributions

Two connected primitives:

  • Continuous reprediction — update the lifetime prediction as the VM continues to run, so the estimate sharpens with evidence. Makes misprediction at t=0 recoverable at t=k.
  • Learned lifetime distribution — emit a full distribution over remaining lifetime, not a point. Lets downstream consumers reason about tail risk (P(still alive at T+24h)) rather than mean error, which matters when the decision cost is asymmetric.

Contrast with [performance

prediction](<./performance-prediction.md>) / RLM on Borg

The 2025-07-29 Google RLM post predicts a different object: the bin-packer's output (MIPS per GCU) given a cluster state. The 2025-10-17 LAVA work predicts a different object: the per- VM remaining-lifetime distribution given that VM's observed trajectory. Both are ML-for-systems cheap approximators for scheduler-internal decisions on Borg-adjacent infrastructure; they operate at different layers.

  • RLM — predict what the slow solver would say (useful when the solver is too slow to run inline).
  • LAVA — predict an input the solver currently lacks (useful when the solver has the right policy but missing information).

Both ship with uncertainty-gated deployment patterns, though the specific fallback shapes differ (distribution-over-lifetime + re- prediction for LAVA; calibrated-distribution-width triggers slow bin-packer in RLM).

Seen in

Last updated · 200 distilled / 1,178 read