PATTERN Cited by 1 source
Start-fast / create-slow Machine lifecycle¶
Shape¶
Expose two distinct Machine-lifecycle primitives through the compute API:
create— instantiate a new Machine from an image. Slow (image pull, filesystem layout, orchestrator registration). Billed.start— re-start a Machine that has beenstopped but exists. Fast (image already laid out, Machine already registered). Billed while running; not billed whilestopped.
The API-level distinction is preserved — there is not a single
"run my Machine" button. Clients are expected to create
once, start/stop many times. The pattern trades a bit of
API surface complexity for an order-of-magnitude faster
resume-from-idle path.
Canonical wiki statement¶
Fly.io, 2025-04-08:
There are two ways to start a Fly Machine: by
creatingit with a Docker container, or bystartingit after it's already beencreated, and laterstopped.Startis lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash thecreatebutton to boot apps up in Fly Machines. But the robots are getting a lot of value out of it.(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)
Why expose two paths¶
The obvious simplification is one button — "boot this Machine" — that hides whether the Machine needs to be created or can be resumed. Fly.io deliberately exposes both. Why:
- Latency asymmetry.
startis double-digit ms;createis seconds. Collapsing them forces every call to pay the worst-case latency. Robots running HTTP-shape wake-on-request workflows can't afford that. - Cost asymmetry.
stopped Machines aren't billed;created-and-never-used Machines are. Clients that choose the lifecycle control when they pay. - State asymmetry.
stoppreserves the filesystem on the worker's NVMe;createstarts from the base image. LLM clients doing stateful incremental build need the preservation; they can't redo the build every cycle.
Consequences at the orchestrator¶
For the orchestrator (in Fly.io's case flyd) the two-path split means:
createallocates and pins. Decides which worker the Machine goes on. Sets up the root filesystem.stopkeeps the allocation. Machine stays on the same worker; filesystem stays on the worker's NVMe. Freed-up resources: CPU / RAM (not disk).startre-runs Firecracker against the prepared disk on the same worker. No scheduling decision. No network set-up.
This is the critical design move: start does not
re-schedule. It doesn't pick a worker. It doesn't allocate
networking. It doesn't pull an image. Fly's claim that
start is "substantially faster than booting up even a
non-virtualized K8s Pod" is because K8s Pod boot re-schedules
(admission, node selection) every time.
Lambda / EC2 positional framing¶
Fly.io positions this shape as a hybrid of Lambda and EC2:
Like a Lambda invocation, a Fly Machine can start like it's spring-loaded, in double-digit millis. But unlike Lambda, it can stick around as long as you want it to: you can run a server, or a 36-hour batch job, just as easily in a Fly Machine as in an EC2 VM.
(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)
The shape borrows from both sides:
- Lambda side:
startlatency (shared Firecracker hypervisor), scale-to-zero while stopped, billed only while running. - EC2 side: Machine persists across
start/stop; filesystem survives; long-running workloads allowed.
Why this is an RX primitive¶
The two-path split is the compute-side half of the RX argument. Vibe-coding workloads (concepts/vibe-coding) are bursty-then-idle:
- Active minute →
start→ pay while running. - Idle hours →
stop→ don't pay, don't lose state. - Next active minute →
startagain, fast → pay while running.
No other major cloud-compute primitive exposes this exact cadence. Lambda is per-invocation; EC2 stop/start is measurable minutes; containers-in-K8s need scheduling rethink on every boot. Fly's API contract is the shape this workload wants.
Implementation prerequisites¶
For the pattern to work on another platform:
- A Machine-level
stopthat preserves the filesystem on the worker. Not just a "container exit" — the disk has to stay. - A Machine-level
startthat reuses the filesystem without re-scheduling. The orchestrator has to keep the Machine pinned to a worker acrossstop/start. - Billing granularity at
start/stop. Otherwise tenants pay for idle time and the pattern degenerates. - Fast-enough boot to fit in an HTTP request. Firecracker or equivalent. Otherwise the start path doesn't wake on demand.
Open questions / limits¶
- Worker eviction. A
stoppedMachine pinned to one worker is a scheduling-stiffness cost the orchestrator pays. If the worker fails or is drained, the Machine has to be migrated (or re-created elsewhere, losing state). Fly.io's 2024 migration rebuild (patterns/async-block-clone-for-stateful-migration) addresses the migration case. - Long-stopped resource cost. The filesystem keeps consuming NVMe even when not billed. Fly's billing model accounts for this implicitly; another platform would have to decide how to price long-stopped disks.
- Re-creation semantics. If the Machine's base image
changes, does
startpick up the new image? No —startis bit-for-bit re-boot of the laid-out disk. The tenant has tocreatea new Machine to pick up an image change. - Failure modes of robots
starting many Machines. A compromised / runaway LLM client couldstartthousands of stopped Machines in a loop. Quotas and rate limits have to catch this at the API tier.
Known uses¶
- Fly.io (2025-04-08 and earlier) — canonical wiki
instance.
create/start/stopprimitives on the Fly Machines API; the subject of the 2025-04-08 "robots" post's compute-side claim.
Related¶
- systems/fly-machines — the system the pattern lives inside.
- systems/flyd — the orchestrator that owns the
per-Machine pinning that makes
startfast. - systems/aws-lambda — the comparator on the fast-boot
axis; doesn't expose the
stop/start-with-state half. - concepts/fly-machine-start-vs-create — the wiki concept this pattern instantiates.
- concepts/cold-start —
startis the fast-path cold start;createis the slow-path one. - concepts/scale-to-zero —
stopis the scale-to-zero leg. - concepts/fast-vm-boot-dx — the boot-latency property that makes the pattern viable.
- concepts/robot-experience-rx — the product-design axis this pattern is an RX data point on.
- patterns/disposable-vm-for-agentic-loop — adjacent pattern. Disposable-VM discards the VM on teardown; this pattern keeps it around. They compose: disposable-VM for untrusted one-shot loops; start-fast/create-slow for trusted long-lived vibe-coding sessions.
- companies/flyio — canonical wiki source.