PATTERN Cited by 1 source

Inside-out VM orchestration¶

Problem¶

Large VM platforms accrete host-side orchestration complexity: a per-host scheduler, a cluster-wide database, a network controller, a storage controller, a log collector, a metrics pipeline, an API gateway — all running outside the user's VM. Changes to any host-side component touch every VM on every host simultaneously. Platform-team velocity drops, not because the code is hard, but because any benign change has fleet-wide blast radius.

Pattern¶

Move the majority of orchestration code from the host to the VM's root namespace. Slide an inner container between the user and the kernel so that platform-owned services (storage stack, service manager, logs, ingress proxy, platform-API handler) have a stable operating surface inside each VM. The host becomes a minimal VM launcher; the VM becomes a self-contained orchestration unit.

Consequences:

Platform changes ship as new VMs picking up the new code, not as live host-daemon restarts. See patterns/blast-radius-in-vm-not-host.
VMs can bounce user code without rebooting the kernel — inner container restart, root namespace stays up.
API-to-VM is short — the platform API for a given VM can be served by that VM's own root namespace.
Fewer metastable-failure modes — fleet-wide invariants are harder to violate because most code doesn't touch the fleet.

Canonical wiki instance — Fly.io Sprites¶

"In the cloud hosting industry, user applications are managed by two separate, yet equally important components: the host, which orchestrates workloads, and the guest, which runs them. Sprites flip that on its head: the most important orchestration and management work happens inside the VM."

"With Sprites, we're pushing this idea as far as we can. The root environment hosts the majority of our orchestration code. […] Our storage stack, which handles checkpoint/restore and persistence to object storage, lives there; so does the service manager we expose to Sprites, which registers user code that needs to restart when a Sprite bounces; same with logs; if you bind a socket to *:8080, we'll make it available outside the Sprite — yep, that's in the root namespace too. […] When you talk to the global API, chances are you're talking directly to your own VM."

"Platform developers at Fly.io know how much easier it can be to hack on init (inside the container) than things like flyd, the Fly Machines orchestrator that runs on the host. Changes to Sprites don't restart host components or muck with global state. The blast radius is just new VMs that pick up the change. We sleep on how much platform work doesn't get done not because the code is hard to write, but because it's so time-consuming to ensure benign-looking changes don't throw the whole fleet into metastable failure. We had that in mind when we did Sprites."

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Services Ptacek names as "inside"¶

Storage stack (JuiceFS-derived; checkpoint/restore; object- store persistence).
Service manager (user-registered services that restart with the Sprite).
Log collection / shipping.
Port-forwarding / ingress proxy (bound sockets → external HTTPS URLs).
Platform-API handler (the global API endpoint for that Sprite).

Relationship to blast-radius-containment patterns¶

Inside-out orchestration is a per-tenant blast-radius strategy. It composes with:

concepts/regionalization-blast-radius-reduction — regionalising services ensures a bad change in one region doesn't cross-contaminate. Sprites' inside-out story reduces blast radius below the region boundary: to individual VMs.
patterns/checkpoint-backup-to-object-storage — DR- level rebuild-from-backup. Sprites' inside-out doesn't eliminate this need; it reduces how often it has to fire.

Why it's not universal¶

Per-VM overhead. Every VM now runs a full platform- services set in its root namespace. Dense workers with many small idle VMs pay this per-VM tax.
Version skew across VMs. VMs running platform code versions N and N-1 coexist during rollouts. Every cross-VM or VM-to-global-API interaction must be version- compatible.
Security-boundary shift. The tenant-isolation boundary stays at the VM (KVM), but the platform/user boundary moves from host-vs-guest to outer-namespace-vs-inner-container. This trusts container isolation for the platform/user boundary where previously VM isolation did the job.
Heterogeneous workloads don't benefit equally. The argument is strongest for sandbox / per-user workloads where one VM = one tenant. For multi-tenant workloads per VM, inside-out buys less.
Not obvious on Day 1. Fly.io shipped Fly Machines without inside-out; Ptacek: "I wish we'd done Fly Machines this way to begin with. I'm not sure there's a downside." The retrospective clarity isn't the same as forward obviousness.

Adjacent patterns¶

Kata Containers: one-VM-per-container is a blast- radius-at-container-granularity story, not an inside-out story. Kata's agent lives in the guest but is narrow-RPC, not orchestration.
Firecracker + AWS Lambda: guest agent exists, but orchestration lives in the host (Firecracker-manager + Lambda-control-plane). The opposite of inside-out.
Kubernetes kubelet-in-pod experiments: same theme at a different granularity.

Seen in¶

[[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki instance.

concepts/inside-out-orchestration
concepts/inner-container-vm
concepts/regionalization-blast-radius-reduction
patterns/blast-radius-in-vm-not-host
systems/fly-sprites
systems/fly-machines — contrast case.
systems/flyd — the Fly-Machine-side host orchestrator this pattern argues against.
companies/flyio