Skip to content

CONCEPT Cited by 1 source

GPU Scale-to-Zero Cold Start

GPU scale-to-zero cold start is the latency penalty paid when inference traffic arrives at a fully-stopped GPU Machine and has to wait for the machine to boot, the model weights to load into GPU RAM, and the first forward pass to complete — the explicit price of pure concepts/scale-to-zero at the GPU-inference tier.

Why the GPU tail is different from the CPU / serverless tail

concepts/cold-start on CPU serverless (Lambda, Cloud Run) is dominated by the runtime init (language VM + user code + SDKs). On a GPU inference worker the runtime is usually already warm by comparison; what dominates is loading the model into GPU RAM. A multi-gigabyte weights file has to traverse the disk → host RAM → PCIe → HBM path, and for large models (tens of GB) this step alone takes tens of seconds.

The three-component budget

A GPU inference Machine resuming from stopped pays a cold-start bill that decomposes cleanly into three stages:

  1. Machine start — the hypervisor boots the VM, the host attaches the GPU, the OS mounts the root filesystem. Seconds.
  2. Model load into GPU RAM — the weights file is read from persistent storage (or object storage, or baked-in rootfs) into host memory and then transferred to HBM; any late-stage initialisation (tokeniser, kernel compilation) runs here. Tens of seconds for a large model.
  3. First-response generation — the first forward pass, including any JIT CUDA-kernel compilation that wasn't cached. Seconds for a single prompt on a large model.

Sum: seconds + tens-of-seconds + seconds → tens of seconds, dominated by stage 2. For a warm Machine, stages 1 and 2 are zero; only stage 3's per-response latency is paid.

Canonical wiki datum — Fly.io a100-40gb / LLaVA-34b ≈ 45 s

Fly.io's 2024-05-09 image-description walkthrough discloses a concrete number on the a100-40gb preset with the 34b-parameter LLaVA: "starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds." — canonical three-stage breakdown in a production Fly platform configuration. (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)

Mitigations (and their cost)

  • Keep the Machine warm (scale-to-one-not-zero). Stage 1 and stage 2 gone; pay idle GPU rental. The trade-off this concept exists to frame.
  • Keep the model on a Fly Volume or local NVMe cache. Stage 2 disk-read cost capped by local NVMe bandwidth — avoids pulling weights from object storage on every cold start. See systems/fly-volumes.
  • Bake the model into the Docker image. No runtime fetch at stage 2; larger image, slower Machine-create on first boot.
  • Pre-warm on request arrival at L7 gateway. Route the first request to a warm path while starting the Machine; useful when tolerable latency budget exceeds warm-path latency only.
  • Quantised / smaller model. Fewer bytes to load, smaller HBM footprint — direct stage 2 reduction.
  • Compiled-kernel cache. Ships the torch.compile / triton autotune results with the image, cutting stage 3's first-forward-pass JIT cost.

When scale-to-zero is worth the tail

  • Long idle gaps + small user base. GPU hourly cost dominates; 20-second cold starts on the first call of the day are acceptable for interactive dev tooling, weekend hobby services, low-QPS internal tools.
  • Deterministic start triggers. The Flycast + Fly Proxy autostop pattern ensures the Machine only wakes on legitimate app-internal traffic, not internet scans — so idle truly is idle and the tail is visible only to real users.

When it's not

  • Interactive user-facing chat with any SLO. 45-second tails fail every modern chat-latency expectation.
  • Spiky bursty traffic where the second request lands during the first cold start. Without queuing semantics the second request pays the same full tail (or worse, if the runtime serialises requests during warmup).
  • Models whose warm per-request latency already exceeds a second or two. Adds-on-top means the user-facing tail is cold-start + warm-latency, which grows quickly.

Contrast — notebook-driven GPU cluster scale-to-zero

The sibling concepts/scale-to-zero instance from Fly.io's 2024-09-24 Livebook/FLAME post has a different shape: a 64-GPU cluster of Fly Machines spun up on notebook-cell execution, running a BERT hyperparameter sweep, evaporating on Livebook disconnect. That workload pays the same three-stage GPU cold-start per node, but the user-facing latency is the cluster-formation time ("start a cluster of GPUs in seconds rather than minutes" per concepts/seconds-scale-gpu-cluster-boot), not per-request — because the notebook workload doesn't need per-request autostart; the user triggered the cluster explicitly.

This concept captures the per-request-autostart variant where the end user sends an inference request and waits for the cold start, mediated by a proxy.

Seen in

Last updated · 200 distilled / 1,178 read