SYSTEM Cited by 4 sources
Fly Volumes¶
Fly Volumes are Fly.io's local-NVMe block storage primitive for stateful Fly Machines. Added in 2021 to support stateful apps on top of a platform that had previously been stateless-compute-only. Each Volume is attached to a specific worker physical — one bus hop from the Machine that mounts it — and encrypted with a per-volume key over LUKS2.
Design bargain¶
Fly chose locally-attached NVMe over an EBS-style SAN fabric for two reasons (both called out explicitly in the 2024-07-30 Making Machines Move post):
- Bus-hop read latency. "A Fly App accessing a file on a Fly Volume is never more than a bus hop away from the data." Canonical concepts/bus-hop-storage-tradeoff instance.
- Startup-era affordability. "We're a startup. Building SAN infrastructure in every region we operate in would be tremendously expensive. Look at any feature in AWS that normal people know the name of, like EC2, EBS, RDS, or S3 — there's a whole company in there. We launched storage when we were just 10 people."
The cost: "A Fly App with an attached Volume is anchored to a
particular worker physical." A Machine with a Volume cannot be
trivially relocated — which broke Fly's
drain playbook for three years
until the 2024
clone-based
migration shipped.
Encryption¶
- Per-volume encryption keys, provisioned alongside the Volume itself. "No one worker has a volume skeleton key."
- Implemented over dm-crypt + LUKS2.
- The cryptsetup userland bridge ships with different LUKS2 header-size defaults across versions (4 MiB vs. 16 MiB) — different versions run across Fly's fleet, which makes encrypted Volumes non-uniform between workers. The 2024 migration work added an RPC to the migration FSM to carry LUKS2 header metadata between source and target workers.
Backup¶
Volumes are backed up "at an interval" to off-network storage. Backup-and-restore is called out explicitly as insufficient for drain migrations: "a 'restore from backup migration' will lose data, and a 'backup and restore' migration incurs untenable downtime." Migration needs something better.
Sparsity¶
Typical customer usage is sparse: "A 100GiB volume with just 5MiB
used wouldn't be at all weird." The 2024 migration work exploits
this via TRIM / DISCARD
integration with dm-clone: run fstrim on
the target-side decrypted view, have it issue DISCARDs to the
clone device, and dm-clone marks those blocks as hydrated without
fetching them.
Seen in¶
- sources/2024-07-30-flyio-making-machines-move — Anchor source.
The Volumes primitive is the direct cause of every architectural
complication in the post — the inability to drain, the need for
dm-clone, the temporary-SAN shape, iSCSI, LUKS2 header skew, 6PN address rewrites. - sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description — Fly Volumes as model-weight storage for stopped GPU Machines. "If you're running Ollama in the cloud, you likely want to put the model onto storage that's persistent, so you don't have to download it repeatedly. You could also build the model into a Docker image ahead of deployment." When the GPU Machine is stopped by Fly Proxy autostop, stage-2 of GPU cold-start (model load into GPU RAM) still has to read the weights from somewhere. A Fly Volume keeps them locally on NVMe across Machine restarts — one-bus-hop read vs. re-fetching from object storage. The Fly Volume on the PocketBase Machine in the same demo holds the SQLite DB.
- sources/2025-04-08-flyio-our-best-customers-are-now-robots — Fly Volumes re-framed as the filesystem LLM-driven coding agents want for stateful incremental VM build. Load-bearing wiki quote that surfaces Fly.io's retrospective regret as a robot-driven value-add: "As product thinkers, our intuition about storage is 'just give people Postgres'. And that's the right answer, most of the time, for humans. But because LLMs are doing the Cursed and Defiled Root Chalice Dungeon version of app construction, what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage." The robot workflow is incrementally-mutated filesystems (packages installed, source edited, systemd units added after boot) — exactly the shape Fly.io initially discouraged. Tigris covers the object-storage half of the same argument. First wiki datum on Fly Volumes as an RX-shaped storage primitive (see concepts/robot-experience-rx).
Mutation-MCP surface (2025-05-07)¶
Fly Volumes are the first mutating resource family exposed
through flyctl's MCP server. Sam Ruby's 2025-05-07 post
(sources/2025-05-07-flyio-provisioning-machines-using-mcps)
announced that flyctl v0.3.117 now exposes the full fly
volumes subcommand family over MCP — create, list, extend,
show, fork, snapshots, destroy — following the 2025-04-10
read-only prototype (fly logs + fly status) by ~27 days.
Volume creation via MCP "worked the first time" on Ruby's
first attempt; "a few hours later, and with the assistance of
GitHub Copilot, i added support for all fly volumes
commands." Load-bearing safety property:
flyctl's existing human-operator
refusal invariant — "can't destroy a volume that is currently
mounted" — carries through the MCP wrapper into the agent-
driven flow. Canonical wiki instance of
patterns/cli-safety-as-agent-guardrail. Also surfaces an
emergent resource-hygiene UX: the LLM spontaneously noted
several unattached volumes and offered to delete the oldest on
request. Security posture: inherits
local-MCP server risk,
now with mutation authority; see
patterns/plan-then-apply-agent-provisioning for the
roadmap-target mitigation via plan-then-apply.
Future direction¶
The post gestures at LSVD (log-structured virtual disk) as the medium-term storage direction — NVMe as a local cache in front of object storage, rather than NVMe as the durable tier. Tigris providing regional S3-compatible object storage in every Fly region makes this plausible without backhauling writes to us-east-1.
Caveats / open questions¶
- Pricing, per-Volume size ceilings, and concurrent-mount semantics are not covered by the migration post.
- Single-node-cluster data-loss disclaimer: "single-node clusters can lose data!" has appeared on Fly's storage launch documentation since launch.
- The migration post doesn't document how Volume-per-Machine quotas interact with migration; whether a Volume can live on a worker different from its Machine during hydration is ambiguous.
Related¶
- systems/fly-machines — What mounts Volumes.
- systems/fly-flyctl — The CLI whose MCP server exposes Volume CRUD (2025-05-07).
- systems/model-context-protocol — The transport.
- systems/dm-clone — How Volumes move.
- systems/dm-crypt-luks2 — How Volumes are encrypted.
- systems/lsvd — Where Fly is heading.
- patterns/async-block-clone-for-stateful-migration — The migration recipe.
- patterns/wrap-cli-as-mcp-server — The pattern whose mutation-transition Volumes are the first resource family to instantiate.
- patterns/cli-safety-as-agent-guardrail — The mounted-volume-refusal invariant as zero-cost agent guardrail.
- patterns/plan-then-apply-agent-provisioning — The roadmap-target UX for multi-resource agent provisioning.
- concepts/natural-language-infrastructure-provisioning — The parent UX posture the MCP Volume surface enables.
- concepts/local-mcp-server-risk — The security posture the mutation-MCP surface inherits.