Skip to content

CONCEPT Cited by 1 source

Fly Machine start vs create

Definition

Fly.io's Machines API exposes two distinct lifecycle paths that both end with a Machine running user code:

  • create — instantiate a brand-new Machine from a Docker container image. Pulls (or reuses a cached) OCI image, sets up the root filesystem, boots Firecracker, starts init, runs the entrypoint. Slow (though still fast by hypervisor standards) and billed.
  • start — re-start a Machine that already exists but is in stopped state. Root filesystem is already laid out; no image pull; Firecracker boots against the prepared disk; init starts. Lightning fast and — once stopped — not billed during the idle window.

The two paths are documented together in Machine states.

Canonical wiki statement

Fly.io, 2025-04-08:

There are two ways to start a Fly Machine: by creating it with a Docker container, or by starting it after it's already been created, and later stopped. Start is lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash the create button to boot apps up in Fly Machines. But the robots are getting a lot of value out of it.

(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)

Why the split exists

A Firecracker micro-VM boot is ms-scale; the overhead "above" Firecracker is the cost that start elides. create has to:

  • Pull or validate the OCI image (network-bound if not cached).
  • Lay out the root filesystem (block-layer writes).
  • Set up networking (allocate addresses on the 6PN mesh, wire up Fly Proxy routes).
  • Register the Machine with flyd state.

Once that one-time setup is done, stop leaves the filesystem intact on the worker's NVMe. start reuses it: just boot Firecracker again against the prepared disk + reattach networking. No image work. No filesystem layout. No Machine registration — flyd already knows about this Machine.

Why this is an RX primitive

A vibe-coding session is bursty-then-idle. Fly's framing:

A vibe coding session generates code conversationally, which is to say that the robots stir up frenzy of activity for a minute or so, but then chill out for minutes, hours, or days. You can create a Fly Machine, do a bunch of stuff with it, and then stop it for 6 hours, during which time we're not billing you. Then, at whatever random time you decide, you can start it back up again, quickly enough that you can do it in response to an HTTP request.

(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)

The load-bearing properties the start path gives:

  1. Cheap to keep — Machine in stopped state is not billed; just consumes some NVMe on the worker for the prepared filesystem. Fly.io can afford to leave many stopped Machines around per tenant.
  2. Fast to resumestart completes fast enough that it fits inside an HTTP request-handling budget (sub-second), the way a serverless cold start aspires to.
  3. Preserves per-Machine state — this is the distinction that makes the primitive useful for stateful LLM-iteration workflows (see concepts/stateful-incremental-vm-build). The filesystem the LLM built up across packages, source edits, and systemd units is there when the Machine starts again. Not true of create-from-image (which starts from the immutable base).

Lambda / Pod comparison

Fly explicitly positions the start path relative to AWS Lambda and Kubernetes Pods:

  • Lambda invocation start — similar latency envelope, because Fly.io runs the same Firecracker hypervisor. But Lambda's "start" means "invoke this function for up to 15 minutes then go away" — no stateful between-invocation disk persistence. Fly's start means "resume the Machine I stopped, with its disk, and run it as long as I want."
  • K8s Pod boot — Fly's claim is start is "substantially faster than booting up even a non-virtualized K8s Pod." A Pod boot involves container scheduling, image pull (or cache hit), container-runtime start, app initialisation. A Fly Machine start skips scheduling (Machine already on a specific worker) and skips image work (filesystem already laid out). Firecracker boot is the only remaining latency.

Human-vs-robot confusion

The post's framing is that humans don't grok the distinction — they "reasonably" mash create — but robots use the API programmatically and get the lifecycle right:

This is too subtle a distinction for humans, who (reasonably!) just mash the create button to boot apps up in Fly Machines. But the robots are getting a lot of value out of it.

An RX-shaped API surface has to keep both paths first-class instead of collapsing them behind a simpler primitive (like a single "run this Machine" button). The 2025-04-08 post is the wiki's first instance of a platform calling out that lifecycle-shape differentiation is an RX property — humans want one button, robots want the two-path split.

Operational numbers

  • start latency: "double-digit millis" / "lightning fast" — matches Lambda invocation start because the hypervisor is shared (Firecracker).
  • Stopped-idle window: "stop it for 6 hours, during which time we're not billing you" — disclosed as a typical vibe-coding gap a Machine sits stopped between bursts.

Seen in

Last updated · 200 distilled / 1,178 read