Skip to content

CONCEPT Cited by 1 source

Seconds-scale GPU cluster boot

The property of a compute platform where a multi-node cluster of GPU-equipped machines, defined by a Docker image, can be booted in seconds rather than minutes. This is the platform-level difference that distinguishes an elastic-GPU workflow that feels like "run this code" from one that feels like "wait for the batch infrastructure to come up".

Definition

"Fly's infrastructure played a key role in making it possible to start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image."

— (Source: sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook)

The specific architectural commitments that hold this up on Fly Machines:

  • Firecracker boot time. Firecracker micro-VMs boot in hundreds of milliseconds to low seconds for a single VM. See concepts/cold-start for the general cold-start concern.
  • Docker-image as the deployment artifact, not a custom AMI. No image build at provisioning time; the image is pulled and booted.
  • Per-Machine scheduling without a queue. A Fly Machine goes from "create" to "running" without passing through a batch scheduler's admission queue. 64 machines in parallel means 64 in-flight create calls, not a serialised fan-out.
  • GPU via whole-GPU passthrough. No per-boot GPU initialisation beyond standard PCI-passthrough bringup; see systems/nvidia-l40s for the specific GPU model in the canonical demo.

Why it matters

Without seconds-scale boot:

  • Notebook-driven workflows (see patterns/notebook-driven-elastic-compute) don't feel like local code — each cell would have an awkward, multi-minute warmup.
  • Scale-to-zero economics only pay off if the re-up cost is trivial; minutes-long cold starts push users toward keeping capacity warm.
  • Hyperparameter-tuning-style fan-outs (64 parallel variants) become a coordination problem instead of a quick experiment.

Seen in

Caveats

  • No concrete p50/p95 in the source. Fly.io's language is "seconds rather than minutes"; the post does not publish a latency histogram for GPU-Machine cold-boot from a user Docker image. Treat as a directional claim.
  • GPU drivers + model weights still need to land. Boot time covers the VM; loading a 70B-parameter model's weights to VRAM is separate. Some of the pipeline patterns (patterns/co-located-inference-gpu-and-object-storage) are designed to keep this fast.
Last updated · 200 distilled / 1,178 read