CONCEPT Cited by 1 source

Seconds-scale GPU cluster boot¶

The property of a compute platform where a multi-node cluster of GPU-equipped machines, defined by a Docker image, can be booted in seconds rather than minutes. This is the platform-level difference that distinguishes an elastic-GPU workflow that feels like "run this code" from one that feels like "wait for the batch infrastructure to come up".

Definition¶

"Fly's infrastructure played a key role in making it possible to start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image."

— (Source: sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook)

The specific architectural commitments that hold this up on Fly Machines:

Firecracker boot time. Firecracker micro-VMs boot in hundreds of milliseconds to low seconds for a single VM. See concepts/cold-start for the general cold-start concern.
Docker-image as the deployment artifact, not a custom AMI. No image build at provisioning time; the image is pulled and booted.
Per-Machine scheduling without a queue. A Fly Machine goes from "create" to "running" without passing through a batch scheduler's admission queue. 64 machines in parallel means 64 in-flight create calls, not a serialised fan-out.
GPU via whole-GPU passthrough. No per-boot GPU initialisation beyond standard PCI-passthrough bringup; see systems/nvidia-l40s for the specific GPU model in the canonical demo.

Why it matters¶

Without seconds-scale boot:

Notebook-driven workflows (see patterns/notebook-driven-elastic-compute) don't feel like local code — each cell would have an awkward, multi-minute warmup.
Scale-to-zero economics only pay off if the re-up cost is trivial; minutes-long cold starts push users toward keeping capacity warm.
Hyperparameter-tuning-style fan-outs (64 parallel variants) become a coordination problem instead of a quick experiment.

Seen in¶

sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook — canonical wiki instance; 64-node L40S cluster booted from a Livebook cell for BERT hyperparameter tuning, with real-time streamed fine-tuning curves.

Caveats¶

No concrete p50/p95 in the source. Fly.io's language is "seconds rather than minutes"; the post does not publish a latency histogram for GPU-Machine cold-boot from a user Docker image. Treat as a directional claim.
GPU drivers + model weights still need to land. Boot time covers the VM; loading a 70B-parameter model's weights to VRAM is separate. Some of the pipeline patterns (patterns/co-located-inference-gpu-and-object-storage) are designed to keep this fast.

systems/fly-machines — the platform whose boot profile makes this concept real.
systems/firecracker — the isolation/boot primitive.
concepts/cold-start — general serverless cold-start concept; this is the GPU-cluster specialisation.
concepts/scale-to-zero — the economic-side complement; the two together make elastic GPU feel local.
patterns/notebook-driven-elastic-compute — the end-user pattern this concept enables.

Seconds-scale GPU cluster boot¶

Definition¶

Why it matters¶

Seen in¶

Caveats¶

Related¶