Skip to content

PATTERN Cited by 1 source

Start-fast / create-slow Machine lifecycle

Shape

Expose two distinct Machine-lifecycle primitives through the compute API:

  • create — instantiate a new Machine from an image. Slow (image pull, filesystem layout, orchestrator registration). Billed.
  • start — re-start a Machine that has been stopped but exists. Fast (image already laid out, Machine already registered). Billed while running; not billed while stopped.

The API-level distinction is preserved — there is not a single "run my Machine" button. Clients are expected to create once, start/stop many times. The pattern trades a bit of API surface complexity for an order-of-magnitude faster resume-from-idle path.

Canonical wiki statement

Fly.io, 2025-04-08:

There are two ways to start a Fly Machine: by creating it with a Docker container, or by starting it after it's already been created, and later stopped. Start is lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash the create button to boot apps up in Fly Machines. But the robots are getting a lot of value out of it.

(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)

Why expose two paths

The obvious simplification is one button — "boot this Machine" — that hides whether the Machine needs to be created or can be resumed. Fly.io deliberately exposes both. Why:

  1. Latency asymmetry. start is double-digit ms; create is seconds. Collapsing them forces every call to pay the worst-case latency. Robots running HTTP-shape wake-on-request workflows can't afford that.
  2. Cost asymmetry. stopped Machines aren't billed; created-and-never-used Machines are. Clients that choose the lifecycle control when they pay.
  3. State asymmetry. stop preserves the filesystem on the worker's NVMe; create starts from the base image. LLM clients doing stateful incremental build need the preservation; they can't redo the build every cycle.

Consequences at the orchestrator

For the orchestrator (in Fly.io's case flyd) the two-path split means:

  • create allocates and pins. Decides which worker the Machine goes on. Sets up the root filesystem.
  • stop keeps the allocation. Machine stays on the same worker; filesystem stays on the worker's NVMe. Freed-up resources: CPU / RAM (not disk).
  • start re-runs Firecracker against the prepared disk on the same worker. No scheduling decision. No network set-up.

This is the critical design move: start does not re-schedule. It doesn't pick a worker. It doesn't allocate networking. It doesn't pull an image. Fly's claim that start is "substantially faster than booting up even a non-virtualized K8s Pod" is because K8s Pod boot re-schedules (admission, node selection) every time.

Lambda / EC2 positional framing

Fly.io positions this shape as a hybrid of Lambda and EC2:

Like a Lambda invocation, a Fly Machine can start like it's spring-loaded, in double-digit millis. But unlike Lambda, it can stick around as long as you want it to: you can run a server, or a 36-hour batch job, just as easily in a Fly Machine as in an EC2 VM.

(Source: sources/2025-04-08-flyio-our-best-customers-are-now-robots)

The shape borrows from both sides:

  • Lambda side: start latency (shared Firecracker hypervisor), scale-to-zero while stopped, billed only while running.
  • EC2 side: Machine persists across start/stop; filesystem survives; long-running workloads allowed.

Why this is an RX primitive

The two-path split is the compute-side half of the RX argument. Vibe-coding workloads (concepts/vibe-coding) are bursty-then-idle:

  • Active minute → start → pay while running.
  • Idle hours → stop → don't pay, don't lose state.
  • Next active minute → start again, fast → pay while running.

No other major cloud-compute primitive exposes this exact cadence. Lambda is per-invocation; EC2 stop/start is measurable minutes; containers-in-K8s need scheduling rethink on every boot. Fly's API contract is the shape this workload wants.

Implementation prerequisites

For the pattern to work on another platform:

  1. A Machine-level stop that preserves the filesystem on the worker. Not just a "container exit" — the disk has to stay.
  2. A Machine-level start that reuses the filesystem without re-scheduling. The orchestrator has to keep the Machine pinned to a worker across stop/start.
  3. Billing granularity at start/stop. Otherwise tenants pay for idle time and the pattern degenerates.
  4. Fast-enough boot to fit in an HTTP request. Firecracker or equivalent. Otherwise the start path doesn't wake on demand.

Open questions / limits

  • Worker eviction. A stopped Machine pinned to one worker is a scheduling-stiffness cost the orchestrator pays. If the worker fails or is drained, the Machine has to be migrated (or re-created elsewhere, losing state). Fly.io's 2024 migration rebuild (patterns/async-block-clone-for-stateful-migration) addresses the migration case.
  • Long-stopped resource cost. The filesystem keeps consuming NVMe even when not billed. Fly's billing model accounts for this implicitly; another platform would have to decide how to price long-stopped disks.
  • Re-creation semantics. If the Machine's base image changes, does start pick up the new image? No — start is bit-for-bit re-boot of the laid-out disk. The tenant has to create a new Machine to pick up an image change.
  • Failure modes of robots starting many Machines. A compromised / runaway LLM client could start thousands of stopped Machines in a loop. Quotas and rate limits have to catch this at the API tier.

Known uses

  • Fly.io (2025-04-08 and earlier) — canonical wiki instance. create/start/stop primitives on the Fly Machines API; the subject of the 2025-04-08 "robots" post's compute-side claim.
Last updated · 200 distilled / 1,178 read