FLYIO 2024-09-24 Tier 3

AI GPU Clusters, From Your Laptop, With Livebook¶

Fly.io's 2024-09-24 recap of Chris McCord's and Chris Grainger's ElixirConf 2024 keynote on Livebook + FLAME + the Nx stack — three Elixir components that together let a developer drive elastic GPU compute on Fly Machines directly from a notebook running on a laptop. The substantive architectural claim is about what the Erlang VM's native cluster-computing primitives enable when Fly.io is the underlying machine-starter: a Livebook cell can spin up 64 L40S GPU Machines, distribute compiled code to each, stream results back in real time, and evaporate the whole cluster on disconnect — "in seconds rather than minutes, and all it requires is a Docker image".

Summary¶

Livebook is Elixir's Jupyter analogue but with one critical difference: it can connect its runtime to any Erlang/Elixir cluster — including nodes started on demand in Fly.io's cloud — and the BEAM cluster's code-distribution primitives let code/module definitions in the notebook be dispatched transparently across every node. FLAME is the Elixir framework that turns any arbitrary code block into an elastically-scheduled remote execution via Flame.call, managing a pool of executors (with min/max/concurrency) without requiring the application to be decomposed into serverless functions. The Nx stack (Nx / Axon / Bumblebee) provides GPU-backed tensor compute, ML model interfaces, and a registry of pre-built models. The keynote demos two workflows: a Llama-summariser-over-video-stills batch driven from a notebook, and a BERT hyperparameter-tuning run across 64 GPU-backed Fly Machines — each a different BERT variant — streaming loss curves back to the notebook in real time. The post closes with Fly.io's own architectural framing: seconds-scale boot of a Docker-image-defined GPU cluster is what makes this notebook-driven workflow feel like local code, and a Kubernetes-side integration by Michael Ruoss (Livebook v0.14.1) now offers the same runtime/FLAME combination on K8s clusters.

Key takeaways¶

Notebook-driven elastic compute on Fly Machines. Any Livebook — including the one on your laptop — can start a runtime inside a Fly Machine in the user's default Fly.io org, giving the notebook networked access to every other app in that org's private network "without doing any network or infrastructure engineering to make that happen". Livebook also attaches as a debugger/introspector to any running Erlang/Elixir app in that infrastructure. This is the canonical wiki instance of patterns/notebook-driven-elastic-compute. (Source: sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook)
FLAME turns arbitrary blocks of code into elastic remote work. Mark off a region of code with Flame.call; the framework runs it on a pool of executors configured with min/max instance counts and concurrency settings. The original motivating example was inline ffmpeg calls that would normally need a job queue or a Lambda function. FLAME is the framework-level instantiation of patterns/framework-managed-executor-pool. "It's the upside of serverless without committing yourself to blowing your app apart into tiny, intricately connected pieces." (Source: same)
BEAM's native code-distribution means notebook code runs anywhere the cluster runs. Livebook "automatically synchronize[s] your notebook dependencies as well as any module or code defined in your notebook across nodes" — so a user-defined module written in the notebook is callable from any FLAME executor without packaging or deploying anything. Auto-completion on the notebook comes from modules defined on the remote node — i.e. Livebook is a thin client over the cluster's introspection primitives. The concept wiki instance is concepts/transparent-cluster-code-distribution. (Source: same)
64 L40S Fly Machines for BERT hyperparameter tuning, driven from a Livebook cell. Chris Grainger (CTO, Amplified) demonstrated generating a cluster of 64 Fly Machines, each with an L40S GPU, each compiling a different BERT variant (different parameters, optimizers) against the same patent corpus, with per-node fine-tuning curves streamed back to the Livebook in real time. The entire cluster terminates when the Livebook runtime disconnects — pure scale-to-zero economics over ephemeral GPU capacity. (Source: same)
Seconds-scale GPU-cluster boot from a Docker image is the load-bearing Fly.io contribution. "Fly's infrastructure played a key role in making it possible to start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image." This is the platform-level claim that makes the notebook UX work — a multi-minute per-machine boot would collapse the illusion of "run this locally or out there, same latency". Wiki concept: concepts/seconds-scale-gpu-cluster-boot. (Source: same)
End-to-end AI pipeline inside a notebook, no queueing layer. The first demo walks an object-store bucket of video files, runs ffmpeg per file to extract stills, streams stills to Llama on GPU Fly Machines (org-scoped) for descriptions, streams results back to the notebook, and feeds them to Mistral for a final summary. As nodes finish their per-video work, new work is dispatched until the bucket is drained; each node idles out and the cluster terminates on disconnect. The pipeline is expressed as normal Elixir code, not as a DAG of queues/functions. (Source: same)
Access control "mostly just does what you want it to". By default, Fly.io isn't exposing the notebook runtime Machine to the internet or to other Fly customers — it runs inside the user's org, with networked access to apps in that org (e.g. the database used to generate a report). The post frames this as the canonical not-thinking-about-network-engineering property that the Fly-org + private-network model buys you. (Source: same)
Four months, part-time, three engineers — ElixirConf EU to US delivery. The Livebook/FLAME integration was first suggested by Chris Grainger at ElixirConf EU; Jonatan Kłosko, Chris McCord, and José Valim implemented it part-time over four months for ElixirConf US. Remote dataframes + distributed GC shipped in explorer#932 "over a weekend". Fly.io's framing: this is a testament to BEAM's capabilities — "bringing the same functionality to other ecosystems would take several additional months, sometimes accompanied by millions in funding, and often times as part of a closed-source product". (Source: same)
Kubernetes integration at parity. Since the feature shipped, Michael Ruoss contributed the same runtime + FLAME functionality for Kubernetes — Livebook v0.14.1 can start Livebook runtimes inside a K8s cluster and elastically scale them with FLAME. The architectural pattern is substrate-independent: the same primitive (BEAM cluster with a machine-starter backend) works over Fly Machines or K8s Pods. (Source: same)

Systems named¶

systems/livebook — Elixir's Jupyter analogue; drives the runtime and distributes code/modules across the cluster.
systems/flame-elixir — Framework-managed elastic executor pool for arbitrary Elixir blocks; Flame.call is the API.
systems/nx-elixir — Elixir AI/ML stack: Nx (tensors + GPU backends), Axon (ML model interface), Bumblebee (pre-built models as a couple of lines of code).
systems/erlang-vm — BEAM: the virtual machine whose native cluster primitives (code distribution, remote introspection) make FLAME + Livebook possible without reinventing the plumbing.
systems/fly-machines — Per-Machine Firecracker micro-VM in the user's Fly org; what Livebook starts as runtime, what FLAME pools.
systems/nvidia-l40s — The specific GPU shape demoed in the 64-node hyperparameter cluster.
systems/llama-3-1 — Named inference model in the video-description demo.
systems/kubernetes — Ruoss's v0.14.1 port of the same Livebook-runtime + FLAME model to K8s.

Concepts named¶

concepts/seconds-scale-gpu-cluster-boot — Fly.io's platform-level contribution, making the notebook UX work.
concepts/transparent-cluster-code-distribution — BEAM primitive Livebook exposes: notebook-defined modules usable anywhere the cluster runs.
concepts/scale-to-zero — Full-cluster termination on notebook disconnect; pure per-second GPU economics.
concepts/inference-vs-training-workload-shape — Training-side rebuttal/counterexample: the BERT hyperparameter tuning demo is a training-shaped batch workload, not inference; it benefits from Fly.io's seconds-boot + per-second billing even though Fly.io's thesis is inference-first.

Patterns named¶

patterns/notebook-driven-elastic-compute — The end-to-end shape: notebook cell → cluster of ephemeral compute nodes → streamed results → cluster tear-down.
patterns/framework-managed-executor-pool — FLAME's core architectural idea: library manages a pool of executors for you, treat the whole app as elastic with Flame.call, no serverless decomposition.

Operational numbers¶

64 — number of L40S-equipped Fly Machines in the BERT hyperparameter tuning demo.
Seconds — Fly.io's claimed boot time for a Docker-image-defined GPU Machine; contrasted against the industry-typical "minutes".
4 months — part-time implementation window between ElixirConf EU (proposal) and ElixirConf US (working demo) for the Livebook/FLAME integration.
A weekend — implementation time for remote dataframes + distributed GC in explorer#932.
Livebook v0.14.1 — version that shipped the Kubernetes-side runtime + FLAME integration (Michael Ruoss).

Caveats¶

Conference-keynote recap. This is a Fly.io Tier-3 post recapping an ElixirConf keynote; it is part product-PR and part architectural claim. The architectural substance — BEAM cluster + Fly-Machine runtime + FLAME executor pool + Nx on GPU — is the load-bearing content and is treated as such here. The marketing frame is noted but not wiki-loaded.
No internal-architecture detail on FLAME's pool manager. The post describes Flame.call semantics (min/max instances, concurrency, idle timeout, termination on disconnect) but does not cover the pool manager's failure model, backpressure, or retry semantics; those are FLAME-project concerns, not Fly.io's.
"Seconds" boot is aspirational framing. Fly.io does not give a concrete p50/p95 for cold-booting a GPU Machine with a user-supplied Docker image in this post — "seconds" is the claim. Related Fly.io posts (e.g. sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half) provide more context on the GPU stack but not boot-latency numbers.
Elixir-ecosystem-specific. The pattern generalises, but the specific implementation ease Fly.io calls out ("weekends" to ship distributed GC) rests on BEAM's existing distribution primitives — i.e. other-ecosystem parity is not free, as Fly.io acknowledges.

Source¶

Original: https://fly.io/blog/ai-gpu-clusters-from-your-laptop-livebook/
Raw markdown: raw/flyio/2024-09-24-ai-gpu-clusters-from-your-laptop-with-livebook-de25938d.md