PATTERN Cited by 1 source
Blast radius in VM, not host¶
Problem¶
On a conventional VM platform, orchestration code lives on the host. Shipping a change means restarting host daemons or, in the worst case, updating state shared by every VM on a host (or every host on the fleet). "Benign" changes become high-risk because their blast radius is fleet-wide. Platform teams slow down — not because coding is hard, but because shipping safely is expensive.
Pattern¶
Make platform changes ship via new VMs. Move orchestration services into the VM's root namespace (see patterns/inside-out-vm-orchestration). A platform rollout then becomes:
- Ship new platform-code baked into the (uniform, since concepts/no-container-image-sprite) standard container.
- Drain warm-pool VMs that run the old code, bring up warm- pool VMs that run the new code (see patterns/warm-pool-zero-create-path).
- New user
createoperations land on the new code. - Existing VMs keep running the old code until they bounce / migrate / checkpoint-restore / get drained.
Changes now have blast radius = new VMs only. Host daemons aren't touched; global state isn't touched; existing VMs aren't interrupted.
Canonical wiki statement¶
Fly.io Sprites, 2026-01-14:
"Platform developers at Fly.io know how much easier it can be to hack on
init(inside the container) than things likeflyd, the Fly Machines orchestrator that runs on the host. Changes to Sprites don't restart host components or muck with global state. The blast radius is just new VMs that pick up the change. We sleep on how much platform work doesn't get done not because the code is hard to write, but because it's so time-consuming to ensure benign-looking changes don't throw the whole fleet into metastable failure. We had that in mind when we did Sprites."(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])
The meta-argument is worth isolating: the dominant cost of platform engineering is often the care required to deploy benign changes safely, not the code itself. A pattern that shrinks the deploy-care cost shrinks the dominant term in platform velocity.
Composable with other blast-radius patterns¶
- concepts/regionalization-blast-radius-reduction — regionalising services keeps a bad change from crossing regions. Blast-radius-in-VM lives inside the region: bad change → some VMs in a region, not all.
- Canary / staged rollouts at the VM level — because new VMs pick up the change, a gradual rollout becomes "ramp the share of new VMs" rather than "ramp the share of restarted daemons on hosts".
- Feature flags inside the VM — new VMs can boot into an off-by-default posture, toggled per-tenant. Compatible with inside-out orchestration.
Operational consequences¶
New-VMs-per-change flow¶
Fresh VMs carry the fresh code. Existing VMs carry whatever code they booted with. Side-effects:
- Stable VMs run stale code. A VM that doesn't bounce for weeks runs weeks-old orchestration code. The global API + cross-VM services must be version-compatible.
- Force-drain is how hotfixes propagate. For a critical orchestration fix, the platform team drains the affected VMs onto the new code — same primitive as fleet drain.
- Rollback is easy: stop rolling the new warm pool; the warm pool reverts; existing VMs weren't touched by the rollout anyway.
Version-skew is the main cost¶
The biggest engineering cost of the pattern: cross-version compatibility becomes the default operating condition. Any API the VM root-namespace services expose (to the global API, to other VMs, to users) must accept requests from N-1 and N-2 versions for however long old VMs live.
Fly.io hasn't disclosed how far back they support, but the wiki record implies long-lived VMs: Ptacek's own kid-MDM Sprite had been up for a month as of the 2026-01-09 post; many Sprites in the Fly.io team's own fleet "stay up for months".
Platform changes are observable after VMs bounce¶
A metric change or bug fix that ships today isn't reflected on existing VMs until they bounce. The platform team's blast radius shrinks, but so does its ability to push forward changes uniformly. Tooling for "which VMs are on which code version" becomes load-bearing.
Trade-offs¶
- Old-code long tail. VMs that never bounce never pick up new code. Some changes (security fixes) must ship even to stale VMs — requires an external push mechanism.
- Per-VM overhead of platform services — the cost Sprites pays to get this pattern. Every VM runs the full platform surface.
- Debugging a change that failed in production means
reasoning about version-skew — "this only breaks on VMs
running
initvN where the global API is at vM". - Slower uniform behaviour-changes across the fleet. Any fleet-wide invariant that depends on all VMs behaving the same is harder to adjust.
- Not applicable to host-only concerns — kernel security patches still require host reboots; network-fabric changes still need cluster-level coordination. Inside-out doesn't eliminate host ops; it reduces the orchestration surface area on hosts.
Other instances on the wiki¶
- Service meshes like Envoy: sidecar-per-pod updates the sidecar on pod restart. Same "blast radius = new pods" shape at container granularity.
- Pipelined Lambda runtime rollouts: new Lambda execution environments pick up new runtime versions; existing warm envs stay on old. Same idea at Lambda scale.
- Kubernetes node-pool upgrades: rolling-replace nodes with new image; blast radius = new-pod-comes-up failures, not whole-fleet.
Seen in¶
- [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki statement.
Related¶
- concepts/inside-out-orchestration
- concepts/inner-container-vm
- concepts/regionalization-blast-radius-reduction
- patterns/inside-out-vm-orchestration
- systems/fly-sprites
- systems/fly-machines — contrast case (host-side orchestration).
- systems/flyd — the host orchestrator this pattern argues against.
- companies/flyio