Fly.io¶
Fly.io is an edge/region-first application platform: "we transmute containers into VMs, running them on our hardware around the world with the power of Firecracker alchemy." Tier-3 source on the sysdesign-wiki — the blog is a mix of partnership announcements, product posts, and occasional architectural retrospectives. On-scope ingests cover the architectural ones (this wiki filters out the pure-product launches per AGENTS.md Tier-3 guidance).
Key systems¶
- systems/phoenix-new — Fly.io's browser-delivered coding
agent for Elixir/Phoenix (2025-06-20, Chris McCord). Per-
session Fly Machine with root shell
shared between developer and a Phoenix-tuned agent, full
Chrome the agent drives,
*.phx.runpreview URLs via integrated port-forwarding,ghCLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock, patterns/agent-driven-headless-browser (colocated variant), and patterns/agentic-pr-triage. - systems/phoenix-framework — Elixir web framework the Phoenix.new agent is tuned for (Channels, Presence, LiveView, Ecto). Also a hosting target for deployed Phoenix apps on Fly.io; adjacent to systems/livebook and systems/flame-elixir on the BEAM-on-Fly.io stack.
- systems/gh-cli — GitHub CLI pre-installed in Phoenix.new session VMs; makes the "close laptop and wait for a PR" async-agent workflow executable without a GitHub-specific agent tool schema.
- systems/tkdb — Fly.io's isolated Macaroon token authority. ~5000 lines of Go, SQLite-backed, replicated US/EU/AU via LiteFS with Litestream PITR. HTTP/Noise RPC (patterns/noise-over-http). Canonical wiki instance of patterns/isolated-token-service. DB size: "a couple dozen megs"; client verification cache hit rate >98%; 0 incident interventions in over a year.
- systems/petsem — "Pet Semetary". Fly's in-house Vault
replacement; its own Macaroon authority. Uses
third-party caveats for
privilege
separation between flyd and user secrets. Explicitly not
merged with
tkdb— "Rule #10 and all that." - systems/litefs — primary/replica distributed SQLite;
subsecond cross-region replication + primary-failover
substrate underneath
tkdb. Works with unmodified SQLite libraries. Post-2025-05-20 its LTX format and LiteVFS extension are both imported into Litestream — the two tools architecturally converge on LTX as the shared wire format. - systems/litestream — streaming WAL-to-object-storage
PITR for SQLite;
tkdb's DR substrate. Seconds-scale restore. 2025-05-20 revamp imports LiteFS's LTX file format + LSM-style compaction + a CASAAS conditional-write lease (replacing Consul for single-leader enforcement) + SQLite-VFS-based read replicas (FUSE-free). Unlocks wildcard/data/*.dbreplication and positions Litestream as a PITR / rollback / fork primitive for agentic coding platforms. 2025-10-02 v0.5.0 ship delivers the write/archive half: three-level hierarchical compaction (30s / 5m / 1h, restore bounded to "a dozen or so files on average"), monotonic TXIDs replacing generations (litestream wal→litestream ltx), per-page compression + end-of-file index in the LTX library (random- access precondition for VFS replicas), CGO removed viamodernc.org/sqlite, NATS JetStream added as a replica type, one-destination-per-database enforced, file-format break from v0.3.x. VFS read-replicas still proof-of-concept. - systems/macaroon-superfly —
github.com/superfly/macaroon
Fly.io's open-source Go Macaroon library; the substrate
underneath
tkdb+petsem. - systems/honeycomb — distributed-tracing backend Fly.io uses; Ptacek's explicit retraction of prior tracing skepticism ("I was wrong") joined with JP Phillips's "I'd have ragequit without oTel."
- systems/opentelemetry — Fly.io's tracing standard;
context propagation
gives single-narrative request traces across primary API
→
tkdb. - systems/tigris — globally distributed, S3-compatible
object storage (Tigris Data, Inc.), integrated into Fly.io as
the
fly storage createprimitive. Three-layer architecture: regional FoundationDB metadata + Fly.io NVMe byte cache + QuiCK-style queuing for distribution to replicas, demand regions, and pluggable third-party backends (including S3). - systems/fly-machines — Fly.io's Firecracker-micro-VM
compute primitive; the building block that Fly Kubernetes
Pods map onto. GPUs attach via whole-GPU passthrough
(fractional-GPU / MIG / vGPU path was tried and abandoned).
Stateful Machines (with attached Fly
Volumes) now migrate via
kill→clone→bootper the 2024-07-30 rebuild. - systems/fly-volumes — Fly.io's locally-attached NVMe block-storage primitive for stateful Machines, encrypted with per-volume LUKS2 keys. The anchor point of the 2024-07-30 migration rebuild: locally-attached NVMe gave Fly.io bus-hop read latency but anchored Machines to a worker physical.
- systems/dm-clone — Linux kernel device-mapper target powering Fly's async-clone-with-background-hydration migration.
- systems/iscsi — Network block protocol Fly uses to expose source Volumes to target workers during migration; settled on after NBD kept getting stuck kernel threads under network disruption.
- systems/nbd — Fly's first attempt at a network block protocol; abandoned.
- systems/dm-crypt-luks2 / systems/cryptsetup — How Fly Volumes are encrypted; cryptsetup version skew across the fleet causes LUKS2 header-size drift (4 MiB vs 16 MiB), which required a metadata RPC in flyd's migration FSM.
- systems/linux-device-mapper — Kernel block-layer proxy; the substrate for dm-clone + dm-crypt.
- systems/corrosion-swim — Fly's SWIM-gossip CRDT-SQLite state distribution system (corrosion2 per the 2025-02-12 exit interview). Rust. "Any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world." Migration broke its worker-as-source-of-truth invariant and forced a redesign. The 2025-05-28 parking_lot post clarifies the architectural relationship: Corrosion is the RIB to fly-proxy's in-memory Catalog FIB, with update propagation in "millisecond intervals of time" host-to-host. The 2025-10-22 dedicated deep-dive — the "deserves its own post" Fly.io had been promising — fills in the mechanism (OSPF-inspired link-state design, SWIM membership + QUIC reconciliation + systems/cr-sqlite CRDT with LWW-by-logical-timestamp), the explicit anti-consensus posture (concepts/no-distributed-consensus), the rejected alternatives (Consul, rqlite, FoundationDB), the three disclosed outages (contagious deadlock, nullable-column backfill, Consul cert-expiry → uplink saturation), and the five-mitigation response (fleet-wide Tokio watchdogs, Antithesis adoption, checkpoint-backups-to-object-storage as break-glass, eliminate-partial-updates with whole-object republish, and the two-level regional + global state regionalization project). Open-sourced: github.com/superfly/corrosion.
- systems/cr-sqlite — CRDT SQLite extension from
vlcn-io; the conflict-resolution
substrate under Corrosion. Tracks CRDT-managed table changes
in
crsql_changes; applies updates last-write-wins using logical timestamps. Known failure mode: nullable-column backfill amplification on large tables. - systems/consul — rejected predecessor to Corrosion; HashiCorp's service-discovery + KV store built on Raft. "Consul is fantastic software. Don't build a global routing system on it." Also the distal trigger of the 2024 uplink-saturation outage (mTLS cert expiry → fleetwide backoff loops → Corrosion write storm).
- systems/parking-lot-rust — Amanieu's replacement for
Rust
std::synclocks (Mutex/RwLock/Condvar/Once). Not Fly.io-authored, but load-bearing in fly-proxy's Catalog and the subject of Fly.io's sixth upstream-the-fix contribution (parking_lot PR #466). 64-bit compact lock word (4 signal bits + 60-bit reader counter);try_write_for(Duration);read_recursive; deadlock detector. Canonical wiki anchor for concepts/bitwise-double-free bug class. - systems/lsvd — Log-structured virtual disk; Fly's stated medium-term storage direction — NVMe-as-cache in front of regional Tigris S3.
- systems/nomad — Fly's pre-flyd orchestrator; referenced as the baseline against which flyd — and the 2024 migration rebuild — are sized ("the biggest thing our team has done since we replaced Nomad with flyd").
- systems/nvidia-a10 / systems/nvidia-l40s / systems/nvidia-a100 / systems/nvidia-h100 — the four NVIDIA GPU models Fly.io stocks. Customer-usage data (2024) revealed the A10 — the least capable — as the most popular by a wide margin; the 2024-08-15 L40S price cut to $1.25/hr (A10 price) was engineered to collapse the choice into a single inference default.
- systems/nvidia-mig — NVIDIA's fractional-GPU primitive; Fly.io tried and abandoned productising it inside Firecracker Machines via IOMMU PCI passthrough.
- systems/fly-kubernetes — Fly.io's managed Kubernetes distribution where every Pod is a Fly Machine (Firecracker micro-VM) orchestrated by flyd rather than containerd/runc.
- systems/flyd — Fly.io's orchestrator that schedules Firecracker-backed Pods. Durable-FSM design (per-step records in BoltDB) lineage-linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes/operations" per JP Phillips's 2025-02-12 exit interview.
- systems/flaps — the Machines-API gateway routing
incoming HTTPS into per-host flyd RPCs. Decentralised ("for
the most part doesn't require any central coordination"),
sub-5-second P90 on
machine createglobally (Johannesburg and Hong Kong excepted). JP's "whole Fly Machines API" framing. - systems/fly-pilot — 2025 successor to init. OCI-compliant runtime with a defined API for flyd to drive; consolidates the feature-bag init described in the 2024-06-19 AWS-without-Access-Keys post. Third of Fly.io's three Rust services (after fly-proxy + corrosion2).
- systems/boltdb — flyd's state store. Deliberate non-SQL pick for blast-radius safety + scope discipline; JP's 2025-02-12 defence: "I've never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet."
- systems/cadence — HashiCorp-era durable-workflow engine; direct design-ancestry cite for flyd's FSM design via JP Phillips. Not a Fly.io runtime dependency.
- systems/firecracker — Fly.io runs user workloads on AWS Firecracker micro-VMs. Substrate/context — Fly.io itself is not the primary wiki source for Firecracker (that's AWS Lambda + Figma), but it is named as Fly.io's isolation layer in every Fly.io source.
- systems/intel-cloud-hypervisor — Fly.io's GPU-Machine hypervisor. A "very similar Rust codebase" to Firecracker that supports PCI passthrough; non-GPU Machines run on Firecracker, GPU Machines run on Cloud Hypervisor. First wiki appearance via the 2025-02-14 GPU retrospective.
- systems/qemu — the conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io rejected it on millisecond-boot DX grounds. Wiki touchpoint.
- systems/vmware — the other conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io explicitly rejected it ("Nvidia suggested VMware (heh)"). Wiki touchpoint.
- systems/virtual-kubelet — CNCF-sandbox Virtual Kubelet provider is the pivot that lets FKS run K8s without Nodes; Fly runs a small Go provider alongside K3s.
- systems/k3s — the lightweight K8s distribution FKS uses for the control-plane API surface.
- systems/fly-proxy — Fly's edge / private-network proxy; backs K8s Services under FKS.
- systems/fly-wireguard-mesh — internal IPv6 WireGuard mesh (6PN) that replaces CNI under FKS and connects every Fly Machine across hosts / regions.
- systems/flycast —
*.flycastprivate-network hostnames; one of the three FKS Service access paths. - systems/fly-gateway — regional fleet of servers that
terminate external customer WireGuard
connections from
flyctl. Separate substrate from the internal 6PN mesh; the subject of the 2024-03-12 JIT WireGuard rewrite. - systems/wggwd — gateway-side daemon that manages WireGuard interfaces; pull-on-demand peer provisioner post-2024-03-12.
- systems/wireguard — underlying protocol for both the internal (6PN) and external (gateway) meshes.
- systems/fly-flyctl — Fly.io's CLI; conjures a TCP/IP stack per invocation and speaks WireGuard to Fly Machines.
- systems/fly-graphql-api — Fly.io's customer-facing control plane; formerly pushed peer configs to gateways, now serves pull requests on handshake arrival.
- systems/oidc-fly-io — Fly.io's in-house OpenID Connect
identity provider (
oidc.fly.io/<org>). Issues short-lived OIDC JWTs exclusively to Fly Machines, with a structuredsubclaim of shape<org>:<app>:<machine>. Lets counterparties (AWS, GCP, Azure, any OIDC-compliant cloud) trust Fly Machines as federated identities without any long-lived credential ever being shared. Canonical wiki instance of workload identity. - systems/fly-init — the Rust-written process-zero binary
in every Fly Machine. Hosts a Unix-socket API proxy at
/.fly/api(Fly's answer to the EC2 Instance Metadata Service, but SSRF-resistant by design) and acts as the credential broker for AWS OIDC federation: detectsAWS_ROLE_ARNat boot, fetches an OIDC token, writes it to/.fly/oidc_token, and exportsAWS_WEB_IDENTITY_TOKEN_FILEfor the AWS SDK's standard credential-provider chain. As of 2025, succeeded by pilot — a full OCI runtime with a formal flyd-driven API — consolidating init's feature bag. - systems/fly-proxy — Fly.io's Rust edge router; one of
the three Rust services on Fly's platform (alongside
corrosion2andpilot). Edge servers "exist almost solely" to run it. Built on Tokio + tokio-rustls; terminates TLS, handles HTTP routing decisions, forwards to worker-hosted Fly Machines over Fly's Anycast fabric. Canonical wiki Seen-in: the 2025-02 IAD CPU-busy-loop incident traced to a tokio-rustlsTlsStreamWaker bug underclose_notifywith buffered trailer. - systems/rustls / systems/tokio-rustls / systems/tokio —
the Rust async / TLS stack
fly-proxyis built on. Rustls is "an ultra-important, load-bearing function in the Rust ecosystem"; Fly.io contributed the 2025-02 upstream fix (rustls PR #1950) — canonical Rust-ecosystem instance of patterns/upstream-the-fix. - systems/livebook / FLAME /
Nx / BEAM — the
Elixir-ecosystem pieces that, with Fly Machines as the
executor substrate, compose into notebook-driven elastic
GPU compute. FLAME
Flame.callblocks pool executors on Fly Machines; Livebook drives them from a laptop; Nx/Axon/ Bumblebee supply the GPU-backed AI/ML primitives; BEAM's native code distribution makes notebook-defined modules executable across the cluster. The Kubernetes-side runtime - FLAME port (Livebook v0.14.1, Michael Ruoss) confirms the pattern is substrate-independent.
- systems/tokenized-tokens — Fly.io's secret-tokenization system (2024 post referenced in the 2025-04-08 "Our Best Customers Are Now Robots" post). Hardware- isolated, "robot-free" Fly Machines hold real OAuth / API credentials and substitute them for placeholder tokens at egress; the LLM client never touches the real secret. Canonical wiki substrate for patterns/tokenized-token-broker and concepts/tokenized-secret.
- systems/model-context-protocol — the open LLM interop protocol Fly.io names in the 2025-04-08 post. Modern MCP uses long-lived SSE connections; multitenant MCP deployments need session-affinity routing; Fly's dynamic request routing is the platform-level answer. Canonical wiki datum on MCP-SSE-as-routing-requirement is concepts/mcp-long-lived-sse.
- systems/flymcp — Fly.io's open-source
github.com/superfly/flymcp
"most basic" MCP server for
flyctl. ~90 lines of Go, 2 tools (fly logs+fly status), MCPstdiotransport, built in "30 minutes". Canonical wiki instance of patterns/wrap-cli-as-mcp-server — the pattern works becauseflyctl --jsonwas already done in 2020. Demonstration of an agentic-incident-diagnosis loop against unpkg; surfaces concepts/local-mcp-server-risk as the structural concern. (Source: sources/2025-04-10-flyio-30-minutes-with-mcp-and-flyctl.) - systems/fly-mcp-launch —
fly mcp launchflyctl subcommand (shippedflyctl v0.3.125, 2025-05-19). Takes any stdio MCP server command and deploys it as a remote HTTP MCP server running on a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place across 6 supported clients (Claude / Cursor / Neovim / VS Code / Windsurf / Zed), and--secretflags piped through to Machine secrets. Canonical wiki instance of patterns/remote-mcp-server-via-platform-launcher. Pairs with flymcp to span local ↔ remote MCP-server deployment axis. (Source: sources/2025-05-19-flyio-launching-mcp-servers-on-flyio.) - systems/aws-lambda — positional comparator. The
2025-04-08 post discloses that Fly Machines (non-GPU)
run on Lambda's hypervisor — Firecracker. "Not
coincidentally, our underlying hypervisor engine is the
same as Lambda's." Fly's value-add is Lambda-like start
latency plus EC2-like runtime duration + stateful
filesystem persistence across
stop/startcycles.
Key patterns / concepts¶
Production-incident debugging (2025-02-26 fly-proxy Rust-TLS incident)¶
- patterns/flamegraph-to-upstream-fix — canonical Fly.io
instance. Symptom (CPU pegging + HTTP errors in IAD) →
flamegraph from an angry
fly-proxy→tracing::Subscriberhot frames as the busy-loop signature → fully-qualifiedFuturetype namestokio_rustls::TlsStreamas the guilty layer → pre-existing upstream issue (tokio-rustls#72) → upstream fix as rustls PR #1950 → partner (Tigris) resumes load test, clean. - patterns/upstream-the-fix — Rust-ecosystem variant. Fourth ecosystem instance (after V8/Node.js/OpenNext Cloudflare, Node Web-streams Cloudflare-as-maintainer, and Datadog's Go-toolchain quartet). "TlsStream is an ultra- important, load-bearing function in the Rust ecosystem. Everybody uses it" → patch upstream, not inside Fly.io.
- patterns/dependency-update-discipline — Fly.io's self-drawn lesson. "Keeping track of what needs to be updated is valuable work. The updates themselves are pretty fast and simple, but the process and testing infrastructure to confidently metabolize dependency updates is not."
- patterns/spurious-wakeup-metric — explicit instrumentation follow-up Fly.io commits to at the end of the post: "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often."
- concepts/async-rust-future / concepts/rust-waker / concepts/asyncread-contract — the primitives the incident teaches. The Fly.io post is a well-crafted in-blog primer on these, because the bug only makes sense once you see the contract Waker + AsyncRead must satisfy.
- concepts/spurious-wakeup-busy-loop /
concepts/cpu-busy-loop-incident — the incident class; the
flamegraph dominated by
tracing::Subscriberis the signature. - concepts/tls-close-notify — the TLS protocol state whose buffered-trailer edge case triggered the rustls bug.
-
concepts/flamegraph-profiling — the diagnostic tool.
-
concepts/durable-execution — flyd's per-FSM-step BoltDB records is Fly.io's canonical instance, framed by JP Phillips's 2025-02-12 exit interview as the load-bearing property of "deploy flyd all day, every day." Ancestry- linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes."
- concepts/bolt-vs-sqlite-storage-choice — Fly.io makes the trade both ways in one stack: flyd picks BoltDB for blast-radius-safety on authoritative state; corrosion2 picks CRDT-SQLite for fleet-queryable read-side distribution. Canonical wiki instance of the design decomposition.
- patterns/per-instance-embedded-database — JP's "if I had to do it over" alternate design: one SQLite per Fly Machine, zip the DB to object storage on Machine destroy. Not built at Fly.io (flyd today is one BoltDB per host); schema management is the named open problem.
- concepts/jit-peer-provisioning — 2024-03-12 gateway rewrite keeps kernel-resident WireGuard peer state near zero by installing on first packet and evicting ruthlessly on cron.
- patterns/jit-provisioning-on-first-packet — the architectural pattern behind the gateway rewrite; Fly.io is the canonical wiki instance.
- patterns/initiator-responder-role-inversion — the sub-RTT install trick on the JIT path: install the peer in the initiator role so the kernel sends the next handshake to the client at install-time.
- patterns/bpf-filter-for-api-event-source — BPF-sniff the WireGuard handshake-initiation byte to manufacture the "incoming connection" event that Netlink doesn't expose.
- patterns/pull-on-demand-replacing-push — 2024-era
architectural migration at Fly.io, retiring dropped-message
NATS pushes on the WireGuard path (and more broadly —
flydwent from NATS-driven to HTTP in the same timeframe). - patterns/state-eviction-cron — cheap because JIT.
- concepts/packet-sniffing-as-event-source — the architectural move generalised.
- concepts/rate-limited-cache — SQLite-backed cache on gateways that shields the Fly API from WireGuard retry storms.
- concepts/noise-protocol — the identity-hiding handshake framework underneath WireGuard; forced Fly to implement ~200 lines of crypto on the gateway.
- concepts/kernel-state-capacity-limit — the scale-driving constraint behind the gateway rewrite.
- concepts/nodeless-kubernetes — running a K8s API without any Node object; FKS is the canonical wiki instance.
- concepts/micro-vm-as-pod — Pod as a Firecracker micro-VM rather than a shared-kernel container; FKS instantiates this at the K8s API tier.
- concepts/managed-kubernetes-service — spectrum (self-managed → managed-control-plane → managed-data-plane → nodeless); FKS sits at the nodeless end.
- concepts/ipv6-service-mesh — WireGuard mesh as the CNI substitute; ClusterIPs are IPv6 under FKS.
- patterns/virtual-kubelet-provider — implement a managed- K8s offering by registering a Virtual-Kubelet provider that forwards Pod-creates into a cloud's existing compute API.
- patterns/primitive-mapping-k8s-to-cloud — map each K8s
primitive 1:1 to a pre-existing cloud primitive instead of
reimplementing the reference stack. FKS's
{CRI, CNI, Pod, Service, Secret, DNS, PV}→ Fly.io table is the canonical wiki instance. - patterns/metadata-db-plus-object-cache-tier — the architectural shape Tigris instantiates on Fly.io (metadata in FDB + byte cache in regional NVMe + distribution queue + pluggable origin). Fly.io is the canonical wiki instance.
- patterns/partner-managed-service-as-native-binding —
fly storage createturns a third-party Tigris service into a first-party-feeling Fly.io primitive, with app secrets auto-injected. - patterns/unified-billing-across-providers — Tigris, Supabase, PlanetScale, Upstash billed through one Fly.io invoice. "Everything gets charged to your Fly.io bill and you pay one bill per month."
- concepts/demand-driven-replication — the Tigris replication policy for larger objects.
- concepts/immutable-object-storage — Tigris preserves the S3 immutable-objects contract on top of its distributed byte plane.
- concepts/inference-vs-training-workload-shape — Fly.io's canonical statement of the workload-shape distinction: "Training workloads tend to look more like batch jobs, and inference tends to look more like transactions. Batch training jobs aren't that sensitive to networking or even reliability. Live inference jobs responding to end-user HTTP requests are." Basis for the GPU product strategy pivot.
- concepts/inference-compute-storage-network-locality — GPU
- instance RAM + object storage + Anycast network combined on one platform, rather than any single axis maximised. Fly.io's thesis for why inference doesn't need frontier GPUs and why hyperscaler GPU instances (with egress) underdeliver.
- concepts/egress-cost — the hyperscaler-surcharge + egress-fee squeeze is Fly.io's framing of why cross-cloud GPU-inference topologies lose.
- concepts/anycast — the network-locality axis of the inference-locality thesis (alongside Tigris + GPU Machines).
- patterns/co-located-inference-gpu-and-object-storage — the pattern the L40S + Tigris + Anycast combination instantiates; Fly.io × Tigris is the canonical wiki instance.
- concepts/workload-identity — Fly Machines as the canonical
wiki instance of platform-attested identity (
<org>:<app>:<machine>in the OIDCsubclaim), used to obtain credentials to other clouds without sharing any long-lived secret. - concepts/oidc-federation-for-cloud-access — Fly.io → AWS
via systems/oidc-fly-io + systems/aws-sts's
AssumeRoleWithWebIdentity; the canonical cross-cloud federation wiki source is sources/2024-06-19-flyio-aws-without-access-keys. - concepts/short-lived-credential-auth — Fly.io's framing: "dead in minutes, sharply limited blast radius, rotate themselves, fail closed" — the canonical wiki line for what STS credentials buy you.
- concepts/machine-metadata-service — Fly's
/.fly/apiUnix socket is self-described as "our answer to the EC2 Instant Metadata Service", but SSRF-resistant by design because a Unix socket is not HTTP-reachable by default. - concepts/unix-socket-api-proxy — the specific IPC shape Fly uses instead of link-local HTTP; gives the Macaroon attach-on-outbound model its privilege-separation properties.
- patterns/oidc-role-assumption-for-cross-cloud-auth — the end-to-end cross-cloud pattern Fly.io instantiates for AWS (and will instantiate for GCP / Azure when someone asks).
- patterns/init-as-credential-broker — Fly init as the guest-side plumbing that closes the loop between platform-attested identity and the AWS SDK's credential-provider chain with a single environment variable.
- patterns/sub-field-scoping-for-role-trust — Fly's choice
of
<org>:<app>:<machine>as thesubshape is what lets AWS trust policies scope a Role to an org, an app, or a single Machine viaStringLikematch on a prefix.
Storage + migration (2024-07-30 rebuild)¶
- concepts/bus-hop-storage-tradeoff — Fly.io's canonical self-assessment: local NVMe trades operational simplicity (drain becomes hard) for bus-hop read latency and startup- era affordability.
- concepts/fleet-drain-operation — The on-call primitive "drain that worker"; the runbook that stateful Machines broke for three years until the 2024 rebuild restored it.
- concepts/kill-copy-boot-migration-tradeoff — The classical
stateful-migration dilemma:
copy-boot-killloses data,kill-copy-boottakes too long. Fly.io'scloneprimitive resolves it. - concepts/block-level-async-clone — The architectural
pattern
dm-cloneinstantiates at the kernel block-device tier; shape-parallel to Cloudflare Artifacts' async clone at the Git tier. - concepts/trim-discard-integration — Using
fstrim+DISCARDto short-circuit hydration of unused blocks on sparse Fly Volumes. - concepts/heterogeneous-fleet-config-skew — cryptsetup version skew causing LUKS2 header-size drift across Fly's workers; generalises to any aging multi-host fleet.
- concepts/embedded-routing-in-ip-address — Fly's 6PN design; trades distributed-routing-protocol cost for address- stability-on-migration cost.
- concepts/hardcoded-literal-address-antipattern — Fly Postgres cluster configs with literal 6PN IPv6 addresses; the antipattern that forced a fleet-wide config rewrite.
- patterns/async-block-clone-for-stateful-migration — The
end-to-end migration recipe:
kill→clone→bootwith iSCSI + dm-clone + backgroundkcopydhydration + flyd-orchestrated FSMs. Canonical wiki instance. - patterns/temporary-san-for-fleet-drain — The fleet-level shape: worker physicals become temporary SANs serving Volumes to fresh-booted replicas on target workers, evaporating when hydration completes. Canonical wiki instance.
- patterns/embedded-routing-header-as-address — The general pattern behind 6PN.
- patterns/fsm-rpc-for-config-metadata-transfer — Fly's migration FSM carries LUKS2 header metadata from source to target worker; generalisable defence for any cross-host operation against fleet config skew.
- patterns/feature-gate-pre-migration-network-rewrite — Ship an in-init address-mapping feature first, then do the fleet-wide config rewrite; Fly's handling of hardcoded literal 6PN addresses in Fly Postgres configs.
Notebook-driven elastic GPU compute (2024-09-24 keynote)¶
- patterns/notebook-driven-elastic-compute — End-to-end shape: Livebook cell → ephemeral Fly Machine cluster → streamed-back results → cluster tear-down on disconnect. Canonical wiki instance.
- patterns/framework-managed-executor-pool —
FLAME's architectural pattern:
library manages a min/max/concurrency-bounded pool of
executor Machines; inline
Flame.callreplaces function-per-operation decomposition. - concepts/seconds-scale-gpu-cluster-boot — Fly's platform-level claim that a 64-node GPU cluster comes up in seconds from a Docker image; the load-bearing latency property behind the notebook UX.
- concepts/transparent-cluster-code-distribution — BEAM primitive Livebook exposes: modules defined in a notebook cell run on any executor without a deploy step.
GPU scale-to-zero (single-Machine) — 2024-05-09 image-description walkthrough¶
- patterns/proxy-autostop-for-gpu-cost-control — Canonical Fly.io instance. Fly Proxy owns start/stop of a GPU Fly Machine: stops on idle silence (minutes-scale), starts on next internal request. App tier never decides; proxy does. The load-bearing cost-control primitive that makes single-Machine GPU inference hobby-project-affordable on a cloud.
- patterns/flycast-scoped-internal-inference-endpoint — The access-scoping pre-requisite that makes autostop's "idle" definition meaningful. Flycast hostname scopes inference-tier access to same-org 6PN traffic only, so random internet scans don't wake the GPU Machine.
- concepts/gpu-scale-to-zero-cold-start — The tail
pattern-eats: three-stage cold-start budget (Machine-start
seconds + model-load-into-GPU-RAM tens of seconds +
first-response seconds) disclosed as ~45 seconds on
a100-40gb+ LLaVA-34b. Different dominant stage from CPU/serverless cold-start.
Remote development / agentic loops (2025-02-07 commentary)¶
- concepts/remote-development-environment — the architectural space Fly.io's 2025-02-07 post operates in; names the two opposite architectures (Emacs Tramp vs VSCode Remote-SSH) on the same SSH substrate.
- concepts/live-off-the-land — the Tramp posture; canonical wiki instance is now on the wiki via Fly's 2025-02-07 framing.
- concepts/agentic-development-loop — Fly.io's canonical phrasing of the closed-loop LLM coding workflow: "close the loop between the LLM and the execution environment […] a semi-effective antidote to hallucination." The 2025-motivating use-case for why anyone wants disposable-VM dev sandboxes.
- patterns/stager-downloads-agent-for-remote-control — the VSCode Remote-SSH architectural pattern Fly critiques; ships a full Node.js runtime + agent to the target host via a bash stager, exposes a WebSocket RPC over an SSH port-forward, persists across reconnects.
- patterns/disposable-vm-for-agentic-loop — the architectural answer Fly.io's 2025-02-07 post argues for: run the agentic loop on a clean-slate, instant-boot, discardable VM (a Fly Machine, unsurprisingly). Canonical wiki instance.
GPU product retrenchment (2025-02-14 retrospective)¶
- concepts/developers-want-llms-not-gpus — Fly.io's canonical demand-side diagnosis. "Developers don't want GPUs. They don't even want AI/ML models. They want LLMs." The 10,000-vs-5-6-developer credo applied to GPU Machines — GPU workloads land on the 5-6 side. Fly.io's canonical wiki instance.
- concepts/gpu-as-hostile-peripheral — the security framing that shaped GPU Machines' productisation. "A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers, with arbitrary, end-user controlled computation, all operating outside our normal security boundary." Canonical wiki instance.
- concepts/nvidia-driver-happy-path — Fly.io canonically discloses the shape of the Nvidia driver happy path (K8s with shared kernel, or QEMU/VMware) and the cost of deviating. Months of failed Cloud Hypervisor integration work; hex-edited closed-source drivers to impersonate QEMU; no MIG path to thin-sliced GPUs because MIG presents as a UUID not a PCI device. Canonical wiki instance.
- concepts/fast-vm-boot-dx — the DX property Fly.io refused to trade for Nvidia-driver-happy-path compatibility. "We could not have offered our desired Developer Experience on the Nvidia happy-path." Canonical wiki statement.
- concepts/asset-backed-bet — Fly.io's risk-framing: GPUs are tradable assets with durable value, so the downside of being wrong about the GPU bet is partially recoverable via liquidation. Companion to the IPv4 address portfolio framing. Canonical wiki instance.
- concepts/insurgent-cloud-constraints — the broader structural framing for why Fly.io can't compete with OpenAI/Anthropic on the LLM-serving axis. Canonical wiki statement.
- concepts/product-market-fit — the meta-framing: "a startup is a race to learn stuff." Course-correction without shame when a bet doesn't hit PMF.
- patterns/dedicated-host-pool-for-hostile-peripheral — Fly's GPU Machines run on dedicated GPU-only workers on Cloud Hypervisor; non-GPU Machines run on Firecracker workers. Isolation posture: peripheral-class segregation at the placement tier. Canonical wiki instance.
- patterns/independent-security-assessment-for-hardware-peripheral — Fly.io's GPU deployment was cleared by two independent external security audits (Atredis, Tetrel) before productisation. "They were not cheap, and they took time." Canonical wiki instance.
- patterns/platform-retrenchment-without-customer-abandonment — Fly.io's 2025-02-14 announcement "if you're using Fly GPU Machines, don't freak out; we're not getting rid of them. But if you're waiting for us to do something bigger with them, a v2 of the product, you'll probably be waiting awhile." Keep-running + no-v2 = retrenchment without abandonment. Canonical wiki instance.
Robot Experience (RX) / robots-as-customers (2025-04-08)¶
- concepts/robot-experience-rx — Fly.io introduces RX (Robot Experience) as a product-design axis alongside UX and DX in the 2025-04-08 post. "One of our north stars has always been nailing the DX of a public cloud. But the robots aren't going anywhere. It's time to start thinking about what it means to have a good RX. […] We have not yet nailed the RX; nobody has. But it's an interesting question." Canonical wiki instance of the framing.
- concepts/vibe-coding — Fly.io's gloss on the LLM-driven conversational code-generation workflow: bursty-then-idle ("frenzy of activity for a minute or so, but then chill out for minutes, hours, or days"). The canonical wiki workload-shape phrasing.
- concepts/fly-machine-start-vs-create — the
lifecycle-split primitive robots consume and humans don't
grok. "
Startis lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash thecreatebutton to boot apps up in Fly Machines. But the robots are getting a lot of value out of it." Canonical wiki instance. - concepts/stateful-incremental-vm-build — the robot-workload storage shape that forces filesystem + object-storage primitives over the Postgres-first human default. "As product thinkers, our intuition about storage is 'just give people Postgres'. […] But because LLMs are doing the Cursed and Defiled Root Chalice Dungeon version of app construction, what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."
- concepts/mcp-long-lived-sse — the networking-tier reason Fly.io's Fly Proxy dynamic request routing is "a robot attractant" — multitenant MCP SSE deployments require session-affinity routing.
- concepts/tokenized-secret — the identity-plane RX primitive: LLMs get a placeholder that a hardware-isolated Fly Machine substitutes for the real OAuth token at egress. Grounded in Fly.io's 2024 tokenized-tokens substrate.
- patterns/start-fast-create-slow-machine-lifecycle —
expose two lifecycle paths (slow
create, faststart) so bursty-then-idle workloads resume at invocation latency from idle. Canonical wiki instance. - patterns/session-affinity-for-mcp-sse — route long- lived MCP SSE connections from a given client back to the same stateful instance. Fly.io instantiates via tenant-controlled dynamic request routing; canonical wiki instance.
- patterns/tokenized-token-broker — hardware-isolated broker substitutes real secrets for placeholders at egress; adjacent to Fly's init-as-credential- broker pattern (STS / OIDC federation variant) and extends it to arbitrary OAuth-style credentials. Canonical wiki instance.
Wrap-CLI-as-MCP / local-MCP-risk (2025-04-10 → 2025-05-07)¶
- patterns/wrap-cli-as-mcp-server — canonical wiki
pattern (Fly.io's
flymcp: 90 LoC Go,
2 tools, stdio transport, 30 minutes). Viable because
flyctl --jsonwas already done in 2020; pass-through CLI credentials; no auth/transport layer. Demonstrated generalisable via the unpkg incident-diagnosis flow. 2025-05-07 mutation transition: same server, now fullfly volumesCRUD; first wiki instance of the pattern crossing the read-only → production-mutation boundary. - concepts/local-mcp-server-risk — canonical wiki statement of the "giving a cloud LLM the ability to run a native program on my machine" concern. Ptacek: "Local MCP servers are scary." The 2025-05-07 post inherits the posture with mutation authority; "if you ask it to destroy a volume, that operation is not reversable." Mitigation: patterns/disposable-vm-for-agentic-loop (run the wrapped CLI inside a throwaway Fly Machine rather than on the operator's laptop — sketched 2 months earlier in the 2025-02-07 VSCode-SSH-bananas post).
- concepts/structured-output-reliability — extended with
the upstream variant: Fly.io's 2020
--jsondecision is the producer-side instance of structured-output discipline that made LLM-consumer tooling trivially viable 5 years later. Different shape from the Dash-judge case (LLM as producer there, LLM as consumer here), same underlying lesson. - concepts/agent-ergonomic-cli — cross-vendor
confirmation of the Cloudflare framing:
flyctl's structured-output axis predates and survives the LLM era as a general automation property LLMs retroactively weaponise. 2025-05-07 extends with the three-way alternative-rejection framing (API / CLI / dashboard all lose to NL + MCP). - concepts/natural-language-infrastructure-provisioning — canonical wiki thesis. "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." "Make it so" as the target UX.
- patterns/plan-then-apply-agent-provisioning — the
aspirational UX the 2025-05-07 post sketches — LLM scans
code, presents a plan, human adjusts + approves, agent
executes, on failure examines logs and proposes next
steps. Terraform's
plan/applydiscipline reimplemented as a conversation. Not yet shipped in flyctl v0.3.117; roadmap target. - patterns/cli-safety-as-agent-guardrail — the
mounted-volume-refusal invariant as the zero-cost
guardrail that let Fly.io safely ship a mutation-authority
MCP surface. "I would have received an error had I tried
to destroy a volume that is currently mounted. Knowing
that gave me the confidence to try the command."
Mutation-side twin of the
--json-as-load-bearing observation — mature CLI design pays an AI-integration dividend the original authors never intended.
Cloud IDE for coding agents (2025-06-20 Phoenix.new)¶
- patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock — canonical wiki
pattern. Productised as Phoenix.new: per-session Fly
Machine, browser-delivered VSCode-style UI, root shell
shared with agent,
*.phx.runpreview URLs. Substrate realisation of patterns/disposable-vm-for-agentic-loop four months after the 2025-02-07 sketch. - patterns/agent-driven-headless-browser — canonical
colocated-browser instance. Full Chrome inside the session
VM the coding agent drives via CDP; sibling of the
MoltWorker proxied-browser instance. Three-signal closed
loop on Phoenix.new (server logs + browser DOM/JS state +
mix testoutput). - patterns/ephemeral-preview-url-via-port-forward —
canonical
*.phx.runinstance. Any port bound in the session VM becomes a publicly-shareable URL automatically; deploy step collapses to zero. Directly addresses Karpathy's "code was the easy part; getting it online took a week" pain. - patterns/agentic-pr-triage — canonical wiki pattern.
McCord's own usage of Phoenix.new against
phoenix-core: "I close my laptop, grab a cup of coffee, and wait for a PR to arrive." Combines ephemeral-VM cloud IDE +gh+ issue-tracker filter to let the agent pick work asynchronously. - concepts/cloud-ide — product-category framing. Phoenix.new as agent-centered ephemeral cloud IDE, contrasted with human-centered persistent cloud IDEs (Codespaces / Gitpod).
- concepts/ephemeral-dev-environment — session-scoped dev environment concept; canonical productised instance.
- concepts/agent-with-root-shell — canonical wiki statement. "It owns the whole environment." Coarse-grained perimeter (VM boundary) posture contrasting with fine-grained capability sandbox posture (Cloudflare Project Think / EmDash).
- concepts/agent-driven-browser — canonical wiki statement. "Instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state."
- concepts/ephemeral-preview-url — canonical
*.phx.runinstance. - concepts/async-agent-workflow — coding-agent specialisation of the 2025-04-08 RX thesis. "The future of development … probably looks less like cracking open a shell and finding a file to edit, and more like popping into a CI environment with agents working away around the clock."
Minimal-agent-loop pedagogy (2025-11-06 Ptacek essay)¶
Canonical statements of agent-architecture vocabulary the wiki had in use but un-named. From "You Should Write An Agent":
- concepts/agent-loop-stateless-llm — "The LLM itself is a stateless black box. The conversation we're having is an illusion we cast, on ourselves." Canonical 15-LoC Python statement of the primitive every agent on the wiki composes over.
- concepts/context-window-as-token-budget — "You're allotted a fixed number of tokens in any context window. Each input you feed in, each output you save, each tool you describe, and each tool output eats tokens." Independent confirmation of the context-window-as-budget framing also surfaced at Dropbox Dash (2025-11-17) and Datadog (2026-03-04). Degradation is nondeterministic: "the whole system begins getting nondeterministically stupider."
- concepts/context-engineering — "Turns out: context engineering is a straightforwardly legible programming problem. […] If Context Engineering was an Advent of Code problem, it'd occur mid-December. It's programming." Canonical statement repudiating the "magic spells" framing of prompt engineering.
- concepts/sub-agent — "Just a new context array, another call to the model. Give each call different tools." Demystifies Claude Code's sub-agents primitive; complementary to existing patterns/specialized-agent-decomposition + patterns/coordinator-sub-reviewer-orchestration.
- patterns/tool-call-loop-minimal-agent — ~30-LoC Python teaching shape for every tool-using agent on the wiki. Emergent multi-step planning: "Did you notice where I wrote the loop in this agent to go find and ping multiple Google properties? Yeah, neither did I."
- patterns/context-segregated-sub-agents — security-, budget-, and specialisation-motivated sub-agent pattern. "You can trivially build an agent with segregated contexts, each with specific tools. That makes LLM security interesting."
The essay is also the wiki's first explicit MCP-is-optional framing (extended onto the MCP system page): when you own both the agent and the tools, native tool-schema JSON against the LLM endpoint is sufficient — MCP earns its place as an interop layer for tools consumed by agents someone else built.
Architectural framing¶
Fly.io's self-description positions it explicitly as a compute + networking platform, with adjacent concerns (storage, databases, object storage) delivered via partner integrations rather than in-house builds. Tigris is the canonical example of this model on the object-storage axis: "we partnered with Tigris, so that they can put their full resources into making object storage as magical as Fly.io is." The blog frames this as a Unix-philosophy posture — "you have individual parts that do one thing very well that are then chained together to create a composite result." The customer-facing trade is that all the parts bill through Fly.io.
The Tigris integration also shows Fly.io willing to be the substrate for the partner (Tigris runs on Fly.io's NVMe volumes and regions), not just a customer. That's a different shape from the typical cloud-partnership pattern of "we wire up your SDK" — Fly.io is renting Tigris the hardware.
Recent articles¶
-
2025-11-06 — You Should Write An Agent — Thomas Ptacek's pedagogical essay canonicalising agent-architecture vocabulary the wiki had in use but un-named. An agent is an HTTP client against one endpoint, a Python list as "context", and a
whileloop — "It's incredibly easy." Four demonstrations build up in ~60 lines of Python: a 15-LoC ChatGPT clone exposing the stateless-LLM + replayed- context illusion; an Alph / Ralph truth/lies personality split showing two context arrays cost the same as one; a three-function upgrade to a tool-using agent that "figures out" multi-steppingprobing ofgoogle.com/www.google.com/8.8.8.8without the author writing the loop (patterns/tool-call-loop-minimal-agent); and a design-space survey covering sub-agents ("just a new context array, another call to the model"), summarisation-as-compression, and concepts/context-engineering as a "straightforwardly legible programming problem". Canonical statement of concepts/context-window-as-token-budget — tool descriptions, tool outputs, and stored replies all compete for the same token budget; past a threshold "the whole system begins getting nondeterministically stupider." Also the wiki's first explicit MCP-is-optional framing: "we didn't need MCP at all. […] MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don't control. Write your own agent. Be a programmer. Deal in APIs, not plugins." Positions MCP correctly as an interop layer for tools consumed by agents-someone-else-built, not as a fundamental enabling technology — consistent with every production MCP instance on this wiki (systems/flymcp, systems/datadog-mcp-server, systems/unity-ai-gateway, Agent Lee). Four open problems the post flags as "noodle-able solo in a basement": titrating nondeterminism vs. structured programming, connecting agents to ground truth so they can't lie to themselves about early-exit, reliable inter-agent intermediate formats (JSON / SQL / markdown summaries), and token allocation + cost containment. Canonicalises: concepts/agent-loop-stateless-llm + concepts/context-window-as-token-budget + concepts/context-engineering + concepts/sub-agent + patterns/tool-call-loop-minimal-agent + patterns/context-segregated-sub-agents. Extends concepts/agentic-development-loop with the minimal-loop foundation; extends patterns/tool-surface-minimization + patterns/specialized-agent-decomposition with the cross-agent sub-agent lever. "Your wackiest idea will probably (1) work and (2) take 30 minutes to code." -
2025-10-22 — Corrosion — The canonical Corrosion deep-dive Fly.io had been promising for over a year ("Corrosion deserves its own post"). Three outages frame the post: (1) the 2024-09-01 contagious deadlock that took down Anycast globally in seconds (the bug was in fly-proxy's
if let-over-RWLock consumer; Corrosion was "just a bystander" perfectly amplifying it); (2) a nullable-column DDL that forced cr-sqlite to backfill every row fleet-wide, melting the cluster; (3) a Consul mTLS cert expiry whose backoff loops on every worker saturated Fly's uplinks because the retry path wrote to Corrosion. Core architectural bet — canonical wiki anchor for concepts/no-distributed-consensus: "truly global distributed consensus promises deliciousness while yielding only immolation. Consensus protocols like Raft break down over long distances." Fly.io took cues from link-state routing protocols ( OSPF) — workers are sources of truth for their own Machines + responsible for flooding changes; Fly's fully-connected WireGuard mesh gives OSPF-style connectivity for free. Stack: SWIM membership (concepts/gossip-protocol) - QUIC for broadcast/reconciliation + systems/cr-sqlite CRDT extension for last-write-wins by logical timestamp. Thousands of workers; seconds convergence. Rejected alternatives named explicitly: Consul ("don't build a global routing system on it"), Zookeeper, etcd, Raft, rqlite ("came very close to using"), FoundationDB, S3-backed stores. Mitigations canonicalised: (i) Tokio watchdogs on every service (patterns/watchdog-bounce-on-deadlock); (ii) production adoption of Antithesis — "killer for distributed systems" — first-person confirmation of the investment JP Phillips's 2025-02-12 exit interview flagged as the external-adoption gate; (iii) checkpoint backups on object storage used "ultimately" to reboot the cluster when diagnosis exceeded restore time; (iv) eliminated partial updates in favour of whole-object republish ("we should have done it this way to begin with"); (v) regionalization project (patterns/two-level-regional-global-state) — per-region clusters + small global app→region cluster, in-progress at time of publication, response to 2024-09-01 contagious deadlock. Scope discipline reaffirmed: "not every piece of state we manage needs gossip propagation" — systems/tkdb + systems/petsem run on systems/litefs/systems/litestream not Corrosion. Open source: github.com/superfly/corrosion.
-
2025-06-20 — Phoenix.new: The Remote AI Runtime for Phoenix — Chris McCord (Phoenix framework creator) introduces Phoenix.new as his Fly.io skunkworks project: a browser-delivered coding agent for Elixir / Phoenix where every session is an ephemeral Fly Machine with a root shell shared between the developer and the agent, a full (not headless-only) Chrome the agent drives to verify UI changes ("instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state"), automatic
*.phx.runpreview URLs from any bound port via integrated port-forwarding, and the GitHubghCLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock (four months after the 2025-02-07 VSCode-SSH-bananas sketch), canonical instance of patterns/agent-driven-headless-browser (colocated variant; sibling of the MoltWorker proxied variant), patterns/ephemeral-preview-url-via-port-forward, and patterns/agentic-pr-triage (McCord "I close my laptop, grab a cup of coffee, and wait for a PR to arrive" againstphoenix-coreissues). Canonical wiki statements of concepts/cloud-ide, concepts/ephemeral-dev-environment, concepts/agent-with-root-shell ("it owns the whole environment"), concepts/agent-driven-browser, concepts/ephemeral-preview-url, and concepts/async-agent-workflow — the coding-agent specialisation of the 2025-04-08 RX thesis. Three-signal closed loop (server logs + browser DOM/JS state + test output) sharpens concepts/agentic-development-loop's previous two-signal framing. System prompt tuned for Phoenix / LiveView / Channels / Presence / Ecto today; "all languages you care about are already installed" (Rails / Expo / Svelte / Go work out of the box; new framework tuning is roadmap). Tetris demo at ElixirConfEU cited as existence proof for frontier-LLM world knowledge covering surface-pattern gaps in LiveView- specific training data. -
2025-10-02 — Litestream v0.5.0 is Here — Ben Johnson's shipping-announcement post for the first batch of the 2025-05-20 Litestream redesign. Confirms what actually landed in v0.5.0: the LTX file format replaces raw-WAL shipping; a three-level hierarchical compaction hierarchy (30s → 5m → 1h windows) gives "a dozen or so files on average" restore cost regardless of retention; the old "generations" abstraction ("Marvel Cinematic Universe parallel dimensions in which your database might be simultaneously living in. Yeah, we didn't like those movies much either") is fully retired in favor of monotonic transaction IDs (TXIDs) (
litestream wal→litestream ltx); the LTX library now compresses per-page with an end-of-file index so individual pages can be plucked out without downloading the whole file (structural precondition for VFS read replicas); CGO is gone — switched frommattn/go-sqlite3tomodernc.org/sqlite(cross-compile- from-Mac-to-Linux Just Works); NATS JetStream joins S3 / GCS / Azure as a replica type; one-replica-destination per database codified as a hard constraint; file-format break from v0.3.x (cutover — old WAL files preserved for rollback); config file is fully backwards compatible. Pedagogic opening example (thesandwiches(id INTEGER PRIMARY KEY AUTOINCREMENT, description TEXT, star_rating INTEGER, reviewer_id INTEGER)table with reviewers dithering between ⭐ and ⭐⭐) illustrates why raw-WAL-shipping restore cost scales with "raw WAL volume" not "distinct logical state." VFS-based read replicas still not shipped — "we already have a proof of concept working"; the read-replica layer is next. HN 430 points. Canonical wiki instance of LTX compaction's concrete 30s / 5m / 1h ladder. -
2025-05-28 — parking_lot: ffffffffffffffff… — Thomas Ptacek long-form debugging retrospective on a weeks-long hunt for why fly-proxy instances in European regions (overwhelmingly
WAW) started locking up after Fly broadened lazy-loading of the fly-proxy Catalog — the in-memory aggregation of Corrosion2 routing state that the proxy consults to forward requests. Architectural framing: fly-proxy is the Anycast router and the hard problem is the state-distribution problem ("managing millions of connections for millions of apps", state in constant flux); Catalog is the FIB to Corrosion's RIB (updates propagate host-to-host in millisecond intervals); the long-term fix is regionalization to shrink the global broadcast domain of routing updates. Sets up two pairs of production incidents: (2024-era Round 0) a global Anycast deadlock caused by anif letread-lock-scope bug — an update about an app nobody used propagated fleet-wide in ms and deadlocked the routing layer. Canonical if-let-lock-scope-bug instance. Short-term response: a watchdog on the fly-proxy internal REPL control channel (concepts/watchdog-repl-channel) that bounces the proxy - snaps core dumps. Canonical
patterns/watchdog-bounce-on-deadlock instance.
(2025 Round 1-5) broadening lazy-loading exposes
writer-contention + suspicious
if let→ Catalog lock refactor: eliminateif let-over-locks, replace RAII with explicit closures (patterns/raii-to-explicit-closure-for-lock-visibility), adopttry_write_fortimeouts + labeled-log telemetry (patterns/lock-timeout-for-contention-telemetry). Still locks up. Lock timing instrumentation fires just before lockups in benign quiet applications. parking_lot's deadlock detector finds nothing. Pavel Zakharov reads core dumps: "no thread running inside the critical section… a thread waiting to acquire write lock and a bunch of threads waiting to acquire a read lock." Every single stack trace. Everyone wants the lock; nobody has it. Descent into madness:miri(finds UB in tests, fixes don't help); guard pages (never trip); wild theories (Tokio and parking_lot both ruled out); close-reading parking_lot source. Desperation probe: switchread()toread_recursive()(patterns/read-recursive-as-desperation-probe), which produces a NEW error:RwLock reader count overflow. First direct evidence of lock-word corruption. Root cause:parking_lot's RwLock state is packed into a 64-bit word (4 signal bits + 60-bit reader counter);try_write_fortimeout path and reader-release unpark path both try to clearWRITER_PARKED; the atomic self-synchronizing clear (viafetch_addof two's-complement inverse) wraps instead of zeroing → lock word becomes0xFFFFFFFFFFFFFFFF. Canonical concepts/bitwise-double-free instance. Thread 1 grabs a read lock; Thread 2 parks with timeout; Thread 1 releases, unparking Thread 2 and clearingWRITER_PARKED; Thread 2 wakes thinking its timeout fired and tries to clearWRITER_PARKEDagain — bitwise double-free. Fix: parking_lot PR #466 - issue #465
— Fly.io's sixth patterns/upstream-the-fix instance on
the wiki + the second in the Rust ecosystem (first was the
2025-02-26 rustls PR #1950). The WAW-specific timing mystery
remains unresolved — "the wax and wane of caribou
populations… we'll never know because we fixed the bug."
Permanent debugging dividends from the arc: Catalog-wide
if let-over-locks audit, RAII → closure refactor, labeled slow-write logs, last-and-current writer context tracking. Canonical wiki anchor for: 2024 fly-proxy global anycast outage + 2025 parking_lot bug + watchdog-bounce safety net + RAII-to-closure refactor + lock-timeout-for-contention telemetry + descent-into-madness debugging + bitwise double-free + parking_lot upstream fix. Tier-3 source; clears bar on production-incident + distributed-systems-internals -
ecosystem-primitive + upstream-the-fix content.
-
2025-05-19 — Launching MCP Servers on Fly.io — Sam Ruby's "part showing off, part opinion" developer- blog announcing
fly mcp launch, a new flyctl subcommand (shipped inflyctl v0.3.125) that takes any existing local / stdio-style MCP server command and one-shots it into a remote HTTP/Streamable- HTTP MCP server running as a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place for six built-in clients (Claude, Cursor, Neovim, VS Code, Windsurf, Zed),--secret KEY=valueflags piped through to Machine secrets, and all Fly platform knobs (auto-stop, Flycast, Volumes, region, VM size) available. Canonical invocation:
fly mcp launch "npx -y @modelcontextprotocol/server-slack" \
--claude --server slack \
--secret SLACK_BOT_TOKEN=xoxb-... \
--secret SLACK_TEAM_ID=T01234567
The post leads with the three-shape MCP-server taxonomy
("basically two types of MCP servers. One small and nimble
that runs as a process on your machine. And one that is a
HTTP server that runs presumably elsewhere and is
standardizing on OAuth 2.1. And there is a third type, but
it is deprecated.") and the client-config fragmentation
complaint that names Claude Desktop's
~/Library/Application Support/Claude/claude_desktop_config.json
(MCPServer key) vs Zed's ~/.config/zed/settings.json
(context_servers key) vs OS-dependent per-tool variants
— canonical wiki statement of
concepts/mcp-client-config-fragmentation. Canonical
wiki instance of systems/fly-mcp-launch +
patterns/remote-mcp-server-via-platform-launcher. Pairs
with the 2025-04-10 flymcp post to span both axes of
MCP-server ergonomics: wrap a local CLI as a local stdio
MCP server (patterns/wrap-cli-as-mcp-server) and
deploy a local MCP server as a remote MCP server
(patterns/remote-mcp-server-via-platform-launcher).
Post also enumerates supported deployment shapes (one
Machine per server / one container per server / in-process
library-mode) and access paths (HTTP Authorization /
WireGuard / Flycast / reverse proxies).
Beta status acknowledged — "examples as shown are thought
to work. Maybe."
-
2025-05-07 — Provisioning Machines using MCPs — Sam Ruby's short developer-blog post marking the mutation transition of Fly.io's flyctl MCP server: the read-only 2-tool prototype from a month earlier (2025-04-10) now covers the full
fly volumessubcommand family (create / list / extend / show / fork / snapshots / destroy) — shipped in flyctl v0.3.117. First wiki instance of patterns/wrap-cli-as-mcp-server crossing the read-only → mutating production-resource boundary. "I created my first fly volume using an MCP … and it worked the first time. A few hours later, and with the assistance of GitHub Copilot, i added support for all fly volumes commands." Load-bearing architectural thesis paragraph for concepts/natural-language-infrastructure-provisioning: "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." Introduces the "Make it so" target UX — LLM scans code, presents a plan, human adjusts, approves, agent executes, on failure examines logs and proposes next steps. Canonical wiki statement of patterns/cli-safety-as-agent-guardrail — the CLI's pre-existing human-operator refusal invariant ("can't destroy a volume that is currently mounted") becomes the MCP server's authorization boundary for free: "Since this support is built on flyctl, I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command." Emergent resource-hygiene UX: Claude spontaneously noted unattached volumes and offered to delete the oldest on request — the agentic troubleshooting loop extended from diagnosis into provisioning-hygiene. Three-way alternative-rejection framing (HTTP Machines API,flyctldirectly, web dashboard) as the concepts/agent-ergonomic-cli design-pressure confirmation. Gestures at future MCP servers running on Fly's private network — "on separate machines, or in 'sidecar' containers, or even integrated into your app" — pairing with the 2025-04-08 robot-routing / long- lived-SSE framing that gave flymcp its deployment substrate. Local-MCP security posture (concepts/local-mcp-server-risk) continues unchanged; the read-only → mutation transition raises the blast radius accordingly. Caveat: "Just be aware this is not a demo, if you ask it to destroy a volume, that operation is not reversable. Perhaps try this first on a throwaway application." Configuration template:claude_desktop_config.jsonsnippet wiresflyctl mcp serveras the stdio command; MCP Inspector (local port 6274) is named as the agent-free validation surface for server authors iterating on tool schemas. -
2025-05-20 — Litestream: Revamped — Ben Johnson's retrospective on the largest Litestream redesign since its 2020 launch. Three ideas imported from LiteFS: (1) the LTX file format — sorted, transaction-aware SQLite page-range changesets — replaces raw-WAL shipping; adjacent LTX files k-way-merge via LSM-style compaction, so restore to any PITR target costs the compacted state size rather than cumulative WAL volume. (2) CASAAS — Compare-and-Swap as a Service — time-based replication lease implemented via object-store conditional writes (S3's 2024-11 launch; Tigris also supports the primitive). Retires LiteFS's Consul dependency for single-leader enforcement; rolling deploys with overlapping Litestream processes against the same destination are now safe; the "generations" abstraction is collapsed to a single-generation invariant. (3) VFS-based read replicas — a SQLite Virtual Filesystem extension loaded into the application that fetches and caches pages directly from object storage; no FUSE required (LiteFS's usability wall). Works in WASM + restricted FaaS environments. Explicit trade named: "this approach isn't as efficient as a local SQLite database" — caching + prefetching are the performance knobs. Two secondary consequences: wildcard / directory replication (
/data/*.db) of hundreds or thousands of SQLite databases from one Litestream process is viable for the first time — previously blocked on WAL-polling cost + slow restores. Agent-storage framing: "the robots that write LLM code are going to like SQLite too. … coding agents like Phoenix.new want [a] way to try out code on live data, screw it up, and then rollback both the code and the state. These Litestream updates put us in a position to give agents PITR as a primitive. On top of that, you can build both rollbacks and forks." Ties to the 2025-04-08 RX framing + concepts/stateful-incremental-vm-build. Forward-looking — "we're building", "should be possible" — with no production numbers; 452 HN points (item 44045292). Extends the canonical [SQLite + LiteFS -
Litestream](<../patterns/sqlite-plus-litefs-plus-litestream.md>) stack (canonical at
tkdb): post-revamp the three layers architecturally converge on LTX as the shared wire format. Wiki-first disclosures: LTX file format, CASAAS / object-store conditional-write lease, SQLite VFS as replication integration point, shadow WAL as legacy mechanism. -
2025-04-10 — 30 Minutes With MCP and flyctl — Thomas Ptacek's internal-message-turned-blog post on building flymcp — the "most basic" MCP server for
flyctl— in 30 minutes, in ~90 lines of Go on mark3labs/mcp-go. Two tools exposed:fly logs+fly status. Canonical wiki instance of patterns/wrap-cli-as-mcp-server. Load-bearing precondition: Fly.io's 2020 decision to give mostflyctlcommands a--jsonmode "to make them easier to drive from automation" (concepts/structured-output-reliability, concepts/agent-ergonomic-cli) — five-year-old automation-friendliness decision retroactively became an AI-integration-readiness decision. Pointed at unpkg (Fly-hosted npm CDN), Claude reconstructed the 10-Machine regional topology, flagged 2 machines in critical status, correlatedoom_killed: trueevents, pulled logs on follow-up, and produced a per-second incident timeline (OOM kill → SIGKILL → reboot → health-check fail → listener up → health-check pass, ~43s end-to-end; Bun process at ~3.7 GB of 4 GB allocated) — canonical concepts/agentic-troubleshooting-loop instantiation with a deliberately-minimal tool surface (patterns/tool-surface-minimization + patterns/allowlisted-read-only-agent-actions). Ptacek: "annoyingly useful … faster than I find problems in apps." Closing caveat is the canonical wiki statement of concepts/local-mcp-server-risk — "Local MCP servers are scary. I don't like that I'm giving a Claude instance in the cloud the ability to run a native program on my machine. I thinkfly logsandfly statusare safe, but I'd rather know it's safe. It would be, if I was runningflyctlin an isolated environment and not on my local machine." — gesturing at patterns/disposable-vm-for-agentic-loop (the 2025-02-07 VSCode-SSH-bananas companion) as the answer. Third RX-era post in a ~three-day window: complements 2025-04-08 Our Best Customers Are Now Robots (RX framing -
MCP-SSE routing requirement) and ties back to the 2025-02-07 VSCode-SSH-bananas disposable-VM sandbox sketch.
-
2025-03-27 — Operationalizing Macaroons — Thomas Ptacek's deepest architectural disclosure of Fly.io's security-token stack to date, written as Fly.io hands off internal ownership of the Macaroon project. Two years in as "the Internet's largest user of Macaroons," the user-facing pitch (end-user attenuation, emailing scoped tokens to partners) has been a mixed bag — "users don't really take advantage of token features" — but the infrastructure wins have made the token system "one of the nicer parts of our platform." Canonical wiki introduction of
tkdb(~5,000 lines of Go managing a SQLite database via LiteFS + Litestream on isolated hardware in US / EU / AU; records encrypted with an injected secret; "one of the very few well-behaved" infra-SQLite databases at Fly.io) — canonical instance of patterns/isolated-token-service + patterns/sqlite-plus-litefs-plus-litestream. Transport is HTTP/Noise withNoise_IKon the verification path (TLS-analog, broad verifier set) andNoise_KKon the signing path (mTLS-analog, "only a handful" of clients with the keypair). Verification cache hit rate >98% thanks to the chained-HMAC construction's descendant-inheritance property. Revocation is the canonical feed- subscription pattern:tkdbexports a revocation- notifications endpoint, clients poll + prune caches, dump the whole cache on connectivity loss; blacklist-to-every- region explicitly rejected ("we certainly don't want to propagate the blacklist database to 35 regions around the globe"). Cosmetic logout named as the anti-pattern this design prevents. Authorization-vs-authentication split via third-party caveats; service tokens usetkdb's caveat-strip API to remove the authN caveat, and recipients further attenuate locally to bind the token to a specific flyd instance or Fly Machine — "exfiltrating it doesn't do you any good; to use it, you have to control the environment it's intended to be used in." Same caveat-for- privilege-separation pattern runs in reverse at Pet Semetary (Fly's Vault replacement): flyd's read-secret Macaroon has a third-party caveat dischargeable only by proving org permissions throughtkdb. Explicit secure-design heuristic: concepts/keep-hazmat-away-from-complex-code — "root secrets for Macaroon tokens are hazmat." Telemetry: OpenTelemetry + Honeycomb + permanent-retention OpenSearch audit trail (concepts/context-propagation-otel + concepts/audit-trail-in-opensearch) — Ptacek's explicit retraction of prior OTel skepticism. Operational datum: "thetkdbcode is remarkably stable and there hasn't been an incident intervention with our token system in over a year." Culture disclosure: Fly.io self-described "allergic to 'microservices'" — but "a second dedicated security service (Petsem)" alongsidetkdbhas "pulled its weight"; narrow-purpose security services are the carve-out exception. Closing nod to infrastructure-SQLite as a "total victory for LiteFS, Litestream" — implicit contrast to corrosion's tens-of- gigabytes operational footprint. -
2025-04-08 — Our Best Customers Are Now Robots — Thomas Ptacek retrospective naming LLM-driven coding agents as the dominant growth driver on Fly.io over the prior ~6 months and introducing Robot Experience (RX) as a product-design axis alongside UX and DX. Four platform-primitive disclosures that make Fly.io "robot bait" without intending to: (1) compute lifecycle —
startvscreatesplit (startis "lightning fast … substantially faster than booting up even a non-virtualized K8s Pod"; "too subtle a distinction for humans" but "the robots are getting a lot of value out of it"); pairs with first wiki disclosure that non-GPU Machines run on Lambda's hypervisor ("Not coincidentally, our underlying hypervisor engine is the same as Lambda's" — Firecracker); Lambda-EC2 hybrid positioning ("start like it's spring-loaded, in double-digit millis […] can stick around as long as you want it to"). (2) Storage — LLMs build Machines incrementally, want a filesystem + object storage, not Postgres (concepts/stateful-incremental-vm-build; "what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."). Fly Volumes + Tigris. (3) Networking — MCP's long-lived SSE connections in multitenant deployments need session-affinity routing (concepts/mcp-long-lived-sse); Fly's dynamic request routing is "possibly a robot attractant" for exactly this shape. Canonical patterns/session-affinity-for-mcp-sse instance. (4) Secrets — tokenized OAuth tokens (concepts/tokenized-secret) let the LLM hold a placeholder while a "hardware-isolated, robot-free Fly Machine" substitutes the real credential at egress (patterns/tokenized-token-broker); grounded in Fly.io's 2024 tokenized-tokens substrate. Forward-looking note: "it should be easy to MCP our API" (not shipped at publication). DX still primary ("the most important engineering work happening today at Fly.io is still DX, not RX; it's managed Postgres (MPG)") but RX is now a first-order concern. Canonical wiki datum for concepts/robot-experience-rx + concepts/vibe-coding + patterns/start-fast-create-slow-machine-lifecycle. Platform-demand-side companion to the 2025-02-14 GPU retrospective (concepts/developers-want-llms-not-gpus). Eighth Fly.io ingest on the wiki. -
2024-05-09 — Picture This: Open Source AI for Image Description — Fly.io Machines-team developer-enablement walkthrough of a weekend-scale open-source image-description service (Ollama + LLaVA + PocketBase + LangChainGo) hosted on Fly Machines. Product framing is accessibility (AI-generated alt text for blind users, screen-reader integration via NVDA); architectural substance is the GPU scale-to-zero recipe with a disclosed cold-start number. An Ollama Fly Machine on the
a100-40gbpreset runs LLaVA-34b behind Flycast; Fly Proxy autostart/autostop stops the Machine after ~minutes of idle silence and starts it on the next internal request from the PocketBase app-tier Machine. Canonical instance of patterns/proxy-autostop-for-gpu-cost-control + patterns/flycast-scoped-internal-inference-endpoint. Disclosed cold-start latency: "starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds." — canonical wiki datum for concepts/gpu-scale-to-zero-cold-start (three-stage budget: Machine-start seconds + model-load-into-GPU-RAM tens of seconds + first-response seconds). Post also names two model payload options on a stopped GPU Machine: Fly Volume for model weights, or bake the model into the Docker image. Side notes: context-window blow-out on the simple followup-chain ("you'll see the quality of responses get poorer — possibly incoherent — as the context exceeds the context window"); modularity claim (swap model + prompt for sentiment / joke-generation / other tasks). Scope disposition: Tier-3 borderline — hobby-project framing but on-scope as GPU-inference scale-to-zero production recipe with a real cold-start number. Sibling to the 2024-09-24 Livebook/FLAME cluster scale-to-zero post (different shape: notebook- driven cluster of 64 L40S Machines vs. single-Machine proxy-autostop here). -
2025-02-26 — Taming a Voracious Rust Proxy — Production-incident retrospective on
fly-proxy. Two IAD edge hosts pegged CPU + spiked HTTP errors over "some number of hours." Bouncingfly-proxycleared it; it came back. Pavel pulled a flamegraph — dominated by Rusttracing::Subscriber(which is supposed to be very fast), signature of a spurious-wakeup busy-loop. The fully-qualifiedFuturetype in the flamegraph pointed past Fly's own wrappers (Duplex,ReusableReader,PeekableReader,MeteredIo,PermittedTcpStream) totokio_rustls::server::TlsStream— pre-existing upstream issue tokio-rustls#72 documented exactly this pattern: on orderly TLSclose_notifyshutdown with still-buffered bytes on the underlying socket,TlsStreammishandles its Waker → 100% CPU. Trigger: Tigris Data load-testing — "tens of thousands of connections" with small HTTP bodies terminating early enough to set up theclose_notify-before- EOF scenario. Fix: upstreamed as rustls PR #1950 — canonical Rust-ecosystem instance of patterns/upstream-the-fix + full diagnostic workflow patterns/flamegraph-to-upstream-fix. Self-drawn lessons: (1) patterns/dependency-update-discipline — "Keep your dependencies updated. Unless you shouldn't…" — the value is in the process + test infrastructure, not the updates themselves; (2) patterns/spurious-wakeup-metric — "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often." Also canonical one-line statement offly-proxy's edge-router role: "Edges exist almost solely to run a Rust program calledfly-proxy, the router at the heart of our Anycast network." First Fly.io production-incident retrospective ingested; complements the prior architectural (Making Machines Move, JIT WireGuard) + identity (AWS without access keys) posts. - 2025-02-14 — We Were Wrong About GPUs — Retrospective / course-correction post by Thomas Ptacek on Fly.io's 2022-era bet on productising GPU Fly Machines. "We're not getting rid of them" + "you'll probably be waiting awhile [for a v2]" — canonical patterns/platform-retrenchment-without-customer-abandonment instance. Three load-bearing disclosures: (1) Hypervisor split — non-GPU Machines on Firecracker, GPU Machines on Intel Cloud Hypervisor (PCI passthrough). Fly rejected QEMU (ms-boot DX required) and VMware (institutional fit). (2) Nvidia driver happy-path disclosure — supported path is K8s- shared-kernel or QEMU/VMware; Fly burned months (and ultimately failed) getting virtualized-GPU drivers working on Cloud Hypervisor, including hex-editing closed-source drivers to impersonate QEMU (concepts/nvidia-driver-happy-path). (3) Demand-side diagnosis — developers don't want GPUs, they want LLMs; insurgent clouds can't compete with OpenAI / Anthropic on tokens-per-second (concepts/insurgent-cloud-constraints). Security posture load-bearing: GPUs are just-about-the-worst-case peripheral, mitigated by dedicated GPU-only worker hosts (patterns/dedicated-host-pool-for-hostile-peripheral) + two independent external audits from Atredis and Tetrel (patterns/independent-security-assessment-for-hardware-peripheral). MIG thin-slicing remained unreachable because MIG "gives you a UUID, not a PCI device." L40S customer segment persists as the one SKU that found fit. Asset-backed-bet framing (concepts/asset-backed-bet): GPU hardware is liquidatable like Fly's IPv4 portfolio, so downside is partially recoverable. Parallel drawn to Fly.io's earlier JS-edge-runtime course-correction: "we were wrong about Javascript edge functions, and I think we were wrong about GPUs." Paired with the JP Phillips exit interview two days earlier as the honest-retrospective half of Fly.io's 2025-Q1 blog posture.
- 2025-02-12 — The Exit Interview: JP Phillips
— Q-and-A with the engineer who led flyd —
Fly.io's Fly Machines orchestrator —
over four years. Architectural retrospective disclosures:
(1) flyd's FSM-plus-durable-steps design is
ancestry-linked to HashiCorp Cadence +
Compose.io/MongoHQ "recipes/operations"
(concepts/durable-execution); deploy-tolerance
("pick back up where it left off, post-deploy") is the
load-bearing property; (2) JP defends BoltDB
over SQLite for flyd's state store — the blast-radius-of-an-
ad-hoc-SQL-statement argument plus "limiting the storage
interface kept flyd's scope managed"
(concepts/bolt-vs-sqlite-storage-choice); (3) alternate
design JP would consider: one SQLite per Fly Machine, with
schema management as the named open problem
(patterns/per-instance-embedded-database); (4)
pilot — Fly's new OCI-compliant init
— consolidates the feature-bag init and
gives flyd a formal driving API; (5) flaps
— the Machines-API gateway — is named as decentralised + hits
sub-5-second P90 on
machine createglobally (Johannesburg / Hong Kong excepted); (6) corrosion2 — Fly's SWIM-gossip CRDT-SQLite state distribution system — is JP's "most impressive thing someone else built," with TLA+/Antithesis validation named as the investment gate for external adoption (patterns/formal-methods-before-shipping); (7) OpenTelemetry + Honeycomb are load-bearing ("without oTel it'd be a disaster … I'd have ragequit trying"). Also cultural content (2023 over-hiring, GPU distraction — sibling to sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half) and a stated platform-completeness framing: "The Fly Machines platform is more or less finished … My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like that's been accomplished." - 2025-02-07 — VSCode's SSH Agent Is Bananas — Architectural critique of VSCode Remote-SSH from the vantage of integrating Fly Machines into VSCode's remote-editing flow. Contrasts Emacs Tramp (lives off the land, no agent deployed) with VSCode Remote (bash stager → downloads Node.js binary + agent → WebSocket RPC over SSH port-forward → persists across reconnects; "murid in nature" — Fly calls the RAT shape). Names the agentic development loop as the 2025-motivating use-case: "close the loop between the LLM and the execution environment […] a semi-effective antidote to hallucination." Argues for disposable-VM-for-agentic-loop as the answer — "a clean-slate Linux instance that spins up instantly and that can't screw you over in any way" — with Fly Machines as the implied substrate. Productisation post deferred.
- 2024-09-24 — AI GPU Clusters, From Your Laptop, With Livebook — ElixirConf 2024 keynote recap. Livebook + FLAME + the Nx stack let a notebook running on a laptop drive elastic GPU-cluster compute on Fly Machines. Canonical demos: Llama-on-L40S summarising video-stills pipeline, and 64 L40S Fly Machines hyperparameter-tuning different BERT variants with per-node loss curves streamed back to the notebook in real time. The whole cluster terminates on notebook disconnect (scale-to-zero). Platform-level claim: "start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image" (concepts/seconds-scale-gpu-cluster-boot). Same runtime+FLAME integration now also runs on Kubernetes (Livebook v0.14.1, Michael Ruoss). Canonical instance of patterns/notebook-driven-elastic-compute and patterns/framework-managed-executor-pool.
- 2024-08-15 — We're Cutting L40S Prices In Half — GPU strategy retrospective + price cut. Customer data surprised Fly.io: the least capable GPU (A10) is the most popular by a wide margin. Fractional-A100 via MIG / vGPU + IOMMU PCI passthrough failed ("a project so cursed"); pivot to whole-GPU attachment. L40S cut to $1.25/hr (= A10 price) to collapse the inference-GPU choice. Architectural thesis: inference = transaction, training = batch; for transaction-shaped inference, the combination of GPU + RAM + Tigris + Anycast beats a bigger GPU. Canonical instance of GPU + object-storage co-location.
- 2024-07-30 — Making Machines Move
— Year-long rebuild of fleet-drain for stateful Fly Machines
with attached Fly Volumes. Introduces
a
cloneprimitive (kill→clone→boot) built on the Linux kernel'sdm-clonedevice-mapper target; clone returns immediately, a new Machine boots on the target worker, reads of un-hydrated blocks fall through to the source worker over iSCSI (NBD tried first and abandoned — kernel threads got stuck under network disruption), andkcopydrehydrates in background. Gnarly complications: cryptsetup version skew → LUKS2 header-size drift (4 MiB vs 16 MiB) → RPC in migration FSM to carry metadata; 6PN address-embeds-routing → migration changes addresses → Fly Postgres configs hardcoded literal addresses → in-init address-mapping bridge + fleet-wide config rewrite; Corrosion (SWIM-gossip SQLite)'s worker-is-source-of-truth invariant breaks. Ends with a nod to LSVD (NVMe-cache + object-store) as the medium-term direction. "This is the biggest thing our team has done since we replaced Nomad with flyd." - 2024-06-19 — AWS without Access Keys
— Fly.io's OIDC IdP (
oidc.fly.io/<org>) +AssumeRoleWithWebIdentity→ Fly Machines get AWS S3 access with no keypair ever stored; Fly init detectsAWS_ROLE_ARN, fetches an OIDC token via/.fly/api, writes it to/.fly/oidc_token, exportsAWS_WEB_IDENTITY_TOKEN_FILEfor the AWS SDK. Macaroon- scoped-per-Machine identity; SSRF-resistant Unix-socket API proxy;<org>:<app>:<machine>sub-field scoping for trust policies. Canonical wiki source for OIDC federation for cloud access and patterns/oidc-role-assumption-for-cross-cloud-auth. - 2024-03-12 — JIT WireGuard
— Gateway-fleet WireGuard peer provisioning flipped from
NATS-push to pull-on-first-packet. BPF-sniffs handshake
initiations, runs ~200 lines of Noise crypto to identify
the peer, pulls config from the Fly API, installs via
Netlink. Kernel stale-peer count: ~550k → "rounds to none."
Also documents Fly.io's broader migration off NATS for
internal RPCs (
flydnow HTTP). - 2024-03-07 — Fly Kubernetes does more now — Fly Kubernetes launched on public beta; Pods-as-Firecracker- micro-VMs; flyd orchestrator; integration surfaces across Fly Machines.
- 2024-02-15 — Globally Distributed Object Storage with
Tigris
— Tigris public beta; architectural framing (FoundationDB +
NVMe + QuiCK-style queue + S3 backend);
fly storage createCLI; unified-billing framing.
Notes on tier¶
Fly.io is a Tier-3 source on the sysdesign-wiki. Per AGENTS.md, Tier-3 ingests require the post to clearly cover distributed- systems internals, scaling trade-offs, infrastructure architecture, production incidents, or similar — product-PR and feature-announcement posts are skipped. The Tigris public-beta post is borderline (it's a launch announcement) but is ingested because the three-layer architectural statement (FDB + NVMe-cache + QuiCK-queue) is load-bearing distributed-systems content.