PATTERN Cited by 1 source
Notebook-driven elastic compute¶
A notebook cell is the unit of work; the cluster is elastic capacity behind it. The user writes code locally; the runtime spins up compute nodes on demand, streams results back, and tears the cluster down on disconnect. Same-code-locally-or-remotely is the UX commitment.
Shape¶
- Notebook client runs anywhere (laptop, container). It drives a runtime that may be local or remote.
- Runtime attaches to a cluster substrate. On-demand compute nodes are provisioned from a Fly org, a Kubernetes cluster, or equivalent — the cluster is visible to the runtime through a runtime-level primitive (concepts/transparent-cluster-code-distribution in the BEAM case).
- Cells dispatch work across the cluster. A code block wrapped
in (e.g.)
Flame.callruns on a pool of executors; the pool size is declared by the user (min/max/concurrency) and enforced by a framework-managed executor pool. - Results stream back to the notebook in real time. Per-node progress (fine-tuning loss curves, per-image descriptions, etc.) is visible as soon as each node produces it — not only at batch completion.
- Cluster terminates on disconnect. If the notebook runtime disconnects, executors shut down; no long-lived capacity is left behind. Full scale-to-zero realisation.
Why it works on Fly Machines¶
The 2024-09-24 Fly.io post makes the integration story explicit:
- Seconds-scale GPU-cluster boot from a Docker image. 64 L40S Fly Machines in the BERT hyperparameter-tuning demo start fast enough that the notebook UX remains interactive.
- Fly-org private network. Notebook runtimes start in the user's org, with networked access to all other apps in that org without explicit network engineering. "This is an access control situation that mostly just does what you want it to do without asking."
- FLAME as the executor-pool library.
Flame.callmarks the elastic region; the framework handles pool bring-up / tear-down. - BEAM code distribution. A module defined in a Livebook cell runs on every executor without a deploy step.
Same architectural pattern is now available on Kubernetes (Livebook v0.14.1) via Michael Ruoss's runtime + FLAME port — substrate-independent.
Use cases in the canonical source¶
- End-to-end AI pipeline without a queue/DAG layer. Stills from videos → Llama on GPU Machines → descriptions → Mistral → final summary. Entire flow is Elixir code in a notebook.
- Hyperparameter-tuning fan-out. 64 BERT variants compiled and fine-tuned on 64 GPU Machines, driven from a single Livebook cell, with per-node curves streaming back.
- Debug/introspect a running production app. Livebook attaches to a running Elixir application (e.g. rtt.fly.dev) for ad-hoc introspection; auto-completion comes from the remote node's modules.
Seen in¶
- sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook — canonical wiki instance; Livebook + FLAME + Nx + Fly Machines.
Anti-patterns avoided¶
- No function-per-operation decomposition. Unlike Lambda/Cloud-Functions style serverless, the whole app stays as one program; only the executor pool is elastic.
- No static cluster capacity. Unlike fixed Spark/Dask clusters, idle cost is zero.
- No separate orchestration DAG. Unlike Airflow/Kubeflow, the notebook cell is the pipeline description.
Related¶
- systems/livebook — the notebook client.
- systems/flame-elixir — the framework that manages the executor pool.
- systems/fly-machines — the canonical substrate in the wiki source.
- systems/kubernetes — alternative substrate (v0.14.1).
- concepts/scale-to-zero — the economic property.
- concepts/seconds-scale-gpu-cluster-boot — the platform-latency property that makes the UX work.
- concepts/transparent-cluster-code-distribution — the runtime-level primitive Livebook exposes.
- patterns/framework-managed-executor-pool — the library-architecture pattern underneath.