CONCEPT Cited by 1 source
Seconds-scale GPU cluster boot¶
The property of a compute platform where a multi-node cluster of GPU-equipped machines, defined by a Docker image, can be booted in seconds rather than minutes. This is the platform-level difference that distinguishes an elastic-GPU workflow that feels like "run this code" from one that feels like "wait for the batch infrastructure to come up".
Definition¶
"Fly's infrastructure played a key role in making it possible to start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image."
— (Source: sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook)
The specific architectural commitments that hold this up on Fly Machines:
- Firecracker boot time. Firecracker micro-VMs boot in hundreds of milliseconds to low seconds for a single VM. See concepts/cold-start for the general cold-start concern.
- Docker-image as the deployment artifact, not a custom AMI. No image build at provisioning time; the image is pulled and booted.
- Per-Machine scheduling without a queue. A Fly Machine goes from "create" to "running" without passing through a batch scheduler's admission queue. 64 machines in parallel means 64 in-flight create calls, not a serialised fan-out.
- GPU via whole-GPU passthrough. No per-boot GPU initialisation beyond standard PCI-passthrough bringup; see systems/nvidia-l40s for the specific GPU model in the canonical demo.
Why it matters¶
Without seconds-scale boot:
- Notebook-driven workflows (see patterns/notebook-driven-elastic-compute) don't feel like local code — each cell would have an awkward, multi-minute warmup.
- Scale-to-zero economics only pay off if the re-up cost is trivial; minutes-long cold starts push users toward keeping capacity warm.
- Hyperparameter-tuning-style fan-outs (64 parallel variants) become a coordination problem instead of a quick experiment.
Seen in¶
- sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook — canonical wiki instance; 64-node L40S cluster booted from a Livebook cell for BERT hyperparameter tuning, with real-time streamed fine-tuning curves.
Caveats¶
- No concrete p50/p95 in the source. Fly.io's language is "seconds rather than minutes"; the post does not publish a latency histogram for GPU-Machine cold-boot from a user Docker image. Treat as a directional claim.
- GPU drivers + model weights still need to land. Boot time covers the VM; loading a 70B-parameter model's weights to VRAM is separate. Some of the pipeline patterns (patterns/co-located-inference-gpu-and-object-storage) are designed to keep this fast.
Related¶
- systems/fly-machines — the platform whose boot profile makes this concept real.
- systems/firecracker — the isolation/boot primitive.
- concepts/cold-start — general serverless cold-start concept; this is the GPU-cluster specialisation.
- concepts/scale-to-zero — the economic-side complement; the two together make elastic GPU feel local.
- patterns/notebook-driven-elastic-compute — the end-user pattern this concept enables.