Fly.io¶

Fly.io is an edge/region-first application platform: "we transmute containers into VMs, running them on our hardware around the world with the power of Firecracker alchemy." Tier-3 source on the sysdesign-wiki — the blog is a mix of partnership announcements, product posts, and occasional architectural retrospectives. On-scope ingests cover the architectural ones (this wiki filters out the pure-product launches per AGENTS.md Tier-3 guidance).

Key systems¶

systems/fly-sprites — 2026-01-09 launch. Durable, checkpointable, Anycast-addressed per-user micro-VMs for coding agents (and small personal apps). ~1-2s create, 100 GB default storage, ~1s checkpoint restore (concepts/first-class-checkpoint-restore), "go idle and stop metering automatically" on a durable substrate (VM not de-allocated, just de-metered), Anycast HTTPS URL per Sprite, indefinite lifetime. The product Ptacek's "Fuck Ephemeral Sandboxes" manifesto argues for. Fly.io self-critique: not built on Fly Machines' container-image-based orchestration — "entirely new storage stack. Orchestrated differently. No Dockerfiles." Canonical wiki instance of patterns/durable-micro-vm-for-agentic-loop and concepts/durable-vs-ephemeral-sandbox's durable end. Coexists with Phoenix.new (the ephemeral-per-session product) — Fly.io now ships two micro-VM orchestrator substrates with different design goals in parallel. 2026-01-14 implementation post (The Design & Implementation of Sprites) discloses the three load-bearing decisions: (a) no-user-container- image + warm-pool create ([[concepts/no-container-image- sprite]], concepts/warm-sprite-pool); (b) object- storage-as-disk-root via a JuiceFS fork with SQLite+ Litestream metadata, sparse NVMe read-through cache (concepts/object-storage-as-disk-root, concepts/metadata-data-split-storage, concepts/read-through-nvme-cache, concepts/durable-state-as-url); (c) inside-out orchestration with user code in an inner container and the root namespace hosting the storage stack, service manager, logs, ingress proxy, and platform-API handler (concepts/inside-out-orchestration, concepts/inner-container-vm). Global orchestrator is an Elixir/Phoenix app using object storage for account metadata + per-account SQLite DBs via Litestream. Anycast URLs propagate via existing Corrosion. Today Sprites = Fly Machines under the hood; open-source local runtime in progress. 2026-03-10 update: Sprites now have an official vendor-hosted remote MCP server at sprites.dev/mcp with a three-axis creation guardrail (org scope × Sprite cap × name prefix).
systems/sprites-mcp — 2026-03-10 ship. Vendor-hosted remote MCP server for Sprites CRUD at sprites.dev/mcp. Plugs into Claude Desktop or any MCP client; authenticates into a single Fly.io org; exposes Sprite lifecycle operations as MCP tools. Ships with a three-axis creation guardrail — (1) org scope (one Fly.io org per session), (2) Sprite-count cap (operator-set maximum), (3) name prefix (operator-set string for post-hoc cleanup / monitoring). First wiki instance of a three-axis agent-creation-quota guardrail at the VM-lifecycle altitude (see concepts/ai-agent-guardrails). Shipped with deliberately conflicted framing from Ptacek: "It's a good product! Just as aesthetes" — the companion-post architectural-position argues for CLI skills over MCP for shell-capable agents; sprites.dev/mcp is positioned as a fallback for shell-less agents (Claude Desktop). Canonical wiki instance of patterns/mcp-as-fallback-for-shell-less-agents.
systems/phoenix-new — Fly.io's browser-delivered coding agent for Elixir/Phoenix (2025-06-20, Chris McCord). Per- session Fly Machine with root shell shared between developer and a Phoenix-tuned agent, full Chrome the agent drives, *.phx.run preview URLs via integrated port-forwarding, gh CLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock, patterns/agent-driven-headless-browser (colocated variant), and patterns/agentic-pr-triage.
systems/phoenix-framework — Elixir web framework the Phoenix.new agent is tuned for (Channels, Presence, LiveView, Ecto). Also a hosting target for deployed Phoenix apps on Fly.io; adjacent to systems/livebook and systems/flame-elixir on the BEAM-on-Fly.io stack.
systems/gh-cli — GitHub CLI pre-installed in Phoenix.new session VMs; makes the "close laptop and wait for a PR" async-agent workflow executable without a GitHub-specific agent tool schema.
systems/tkdb — Fly.io's isolated Macaroon token authority. ~5000 lines of Go, SQLite-backed, replicated US/EU/AU via LiteFS with Litestream PITR. HTTP/Noise RPC (patterns/noise-over-http). Canonical wiki instance of patterns/isolated-token-service. DB size: "a couple dozen megs"; client verification cache hit rate >98%; 0 incident interventions in over a year.
systems/petsem — "Pet Semetary". Fly's in-house Vault replacement; its own Macaroon authority. Uses third-party caveats for privilege separation between flyd and user secrets. Explicitly not merged with tkdb — "Rule #10 and all that."
systems/litefs — primary/replica distributed SQLite; subsecond cross-region replication + primary-failover substrate underneath tkdb. Works with unmodified SQLite libraries. Post-2025-05-20 its LTX format and LiteVFS extension are both imported into Litestream — the two tools architecturally converge on LTX as the shared wire format.
systems/litestream — streaming WAL-to-object-storage PITR for SQLite; tkdb's DR substrate. Seconds-scale restore. 2025-05-20 revamp imports LiteFS's LTX file format + LSM-style compaction + a CASAAS conditional-write lease (replacing Consul for single-leader enforcement) + SQLite-VFS-based read replicas (FUSE-free). Unlocks wildcard /data/*.db replication and positions Litestream as a PITR / rollback / fork primitive for agentic coding platforms. 2025-10-02 v0.5.0 ship delivers the write/archive half: three-level hierarchical compaction (30s / 5m / 1h, restore bounded to "a dozen or so files on average"), monotonic TXIDs replacing generations (litestream wal → litestream ltx), per-page compression + end-of-file index in the LTX library (random- access precondition for VFS replicas), CGO removed via modernc.org/sqlite, NATS JetStream added as a replica type, one-destination-per-database enforced, file-format break from v0.3.x. 2025-12-11 Litestream VFS ship delivers the read half: Litestream VFS (.load litestream.so + file:///my.db?vfs=litestream) resolves SQLite page reads via HTTP Range GET against LTX files in S3-compatible storage using the ~1%-of-file EOF index trailer per file, fronted by an LRU cache of hot B-tree pages. Near-realtime replica via L0 polling (L0 = 1-second upload cadence, retained until L1 compaction — the ladder gains an explicit L0 level on top of L1/L2/L3 + daily full snapshots above L3). SQL-level PITR via PRAGMA litestream_time with relative ("5 minutes ago") or absolute timestamps. VFS is opt-in, additive, read-side-only (writes still go through the primary Litestream Unix program). Canonical wiki instances of patterns/vfs-range-get-from-object-store + patterns/near-realtime-replica-via-l0-polling.
systems/litestream-vfs — the 2025-12-11 shipped SQLite VFS extension on top of Litestream. Loaded via .load litestream.so; overrides only SQLite's read path; resolves page reads via HTTP Range GETs against LTX files in object storage; uses the LTX EOF index trailers (~1% of each file) to build a database-wide page index on cold-open; LRU-caches hot B-tree pages; polls the L0 compaction level (1 file/second) for near-realtime freshness; exposes PRAGMA litestream_time as the SQL-level PITR knob. Realises the read-replica layer that LiteFS's LiteVFS has had for years — architecture converges: both tools now share LTX as wire format and VFS as integration surface. The FUSE-free integration + ephemeral-server-fast-startup posture is explicitly pitched at agentic coding platforms (Phoenix.new-shape consumers). 2026-02-04 writable + hydration modes: two opt-in env-var-gated extensions — LITESTREAM_WRITE_ENABLED=true (single-writer, ~1 s buffered sync; polling disabled; "eventual durability" SLO) and LITESTREAM_HYDRATION_PATH=/tmp/db.sqlite (serve remote reads while a background loop pulls the whole DB to a local temp file; shape "shoplifted from dm-clone"; file discarded on VFS exit). Canonical production consumer: the Sprite block map — a forked JuiceFS metadata tier running on SQLite + Litestream VFS, with block maps of "low tens of megabytes" reconstituted within an HTTP-request timing budget on cold boot. First wiki instance of patterns/metadata-plus-chunk-storage-stack applied recursively (metadata-tier itself is object-store-rooted).
systems/macaroon-superfly — github.com/superfly/macaroon Fly.io's open-source Go Macaroon library; the substrate underneath tkdb + petsem.
systems/honeycomb — distributed-tracing backend Fly.io uses; Ptacek's explicit retraction of prior tracing skepticism ("I was wrong") joined with JP Phillips's "I'd have ragequit without oTel."
systems/opentelemetry — Fly.io's tracing standard; context propagation gives single-narrative request traces across primary API → tkdb.
systems/tigris — globally distributed, S3-compatible object storage (Tigris Data, Inc.), integrated into Fly.io as the fly storage create primitive. Three-layer architecture: regional FoundationDB metadata + Fly.io NVMe byte cache + QuiCK-style queuing for distribution to replicas, demand regions, and pluggable third-party backends (including S3).
systems/fly-machines — Fly.io's Firecracker-micro-VM compute primitive; the building block that Fly Kubernetes Pods map onto. GPUs attach via whole-GPU passthrough (fractional-GPU / MIG / vGPU path was tried and abandoned). Stateful Machines (with attached Fly Volumes) now migrate via kill → clone → boot per the 2024-07-30 rebuild.
systems/fly-volumes — Fly.io's locally-attached NVMe block-storage primitive for stateful Machines, encrypted with per-volume LUKS2 keys. The anchor point of the 2024-07-30 migration rebuild: locally-attached NVMe gave Fly.io bus-hop read latency but anchored Machines to a worker physical.
systems/dm-clone — Linux kernel device-mapper target powering Fly's async-clone-with-background-hydration migration.
systems/iscsi — Network block protocol Fly uses to expose source Volumes to target workers during migration; settled on after NBD kept getting stuck kernel threads under network disruption.
systems/nbd — Fly's first attempt at a network block protocol; abandoned.
systems/dm-crypt-luks2 / systems/cryptsetup — How Fly Volumes are encrypted; cryptsetup version skew across the fleet causes LUKS2 header-size drift (4 MiB vs 16 MiB), which required a metadata RPC in flyd's migration FSM.
systems/linux-device-mapper — Kernel block-layer proxy; the substrate for dm-clone + dm-crypt.
systems/corrosion-swim — Fly's SWIM-gossip CRDT-SQLite state distribution system (corrosion2 per the 2025-02-12 exit interview). Rust. "Any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world." Migration broke its worker-as-source-of-truth invariant and forced a redesign. The 2025-05-28 parking_lot post clarifies the architectural relationship: Corrosion is the RIB to fly-proxy's in-memory Catalog FIB, with update propagation in "millisecond intervals of time" host-to-host. The 2025-10-22 dedicated deep-dive — the "deserves its own post" Fly.io had been promising — fills in the mechanism (OSPF-inspired link-state design, SWIM membership + QUIC reconciliation + systems/cr-sqlite CRDT with LWW-by-logical-timestamp), the explicit anti-consensus posture (concepts/no-distributed-consensus), the rejected alternatives (Consul, rqlite, FoundationDB), the three disclosed outages (contagious deadlock, nullable-column backfill, Consul cert-expiry → uplink saturation), and the five-mitigation response (fleet-wide Tokio watchdogs, Antithesis adoption, checkpoint-backups-to-object-storage as break-glass, eliminate-partial-updates with whole-object republish, and the two-level regional + global state regionalization project). Open-sourced: github.com/superfly/corrosion.
systems/cr-sqlite — CRDT SQLite extension from vlcn-io; the conflict-resolution substrate under Corrosion. Tracks CRDT-managed table changes in crsql_changes; applies updates last-write-wins using logical timestamps. Known failure mode: nullable-column backfill amplification on large tables.
systems/consul — rejected predecessor to Corrosion; HashiCorp's service-discovery + KV store built on Raft. "Consul is fantastic software. Don't build a global routing system on it." Also the distal trigger of the 2024 uplink-saturation outage (mTLS cert expiry → fleetwide backoff loops → Corrosion write storm).
systems/parking-lot-rust — Amanieu's replacement for Rust std::sync locks (Mutex / RwLock / Condvar / Once). Not Fly.io-authored, but load-bearing in fly-proxy's Catalog and the subject of Fly.io's sixth upstream-the-fix contribution (parking_lot PR #466). 64-bit compact lock word (4 signal bits + 60-bit reader counter); try_write_for(Duration); read_recursive; deadlock detector. Canonical wiki anchor for concepts/bitwise-double-free bug class.
systems/lsvd — Log-structured virtual disk; Fly's stated medium-term storage direction — NVMe-as-cache in front of regional Tigris S3.
systems/nomad — Fly's pre-flyd orchestrator; referenced as the baseline against which flyd — and the 2024 migration rebuild — are sized ("the biggest thing our team has done since we replaced Nomad with flyd").
systems/nvidia-a10 / systems/nvidia-l40s / systems/nvidia-a100 / systems/nvidia-h100 — the four NVIDIA GPU models Fly.io stocks. Customer-usage data (2024) revealed the A10 — the least capable — as the most popular by a wide margin; the 2024-08-15 L40S price cut to $1.25/hr (A10 price) was engineered to collapse the choice into a single inference default.
systems/nvidia-mig — NVIDIA's fractional-GPU primitive; Fly.io tried and abandoned productising it inside Firecracker Machines via IOMMU PCI passthrough.
systems/fly-kubernetes — Fly.io's managed Kubernetes distribution where every Pod is a Fly Machine (Firecracker micro-VM) orchestrated by flyd rather than containerd/runc.
systems/flyd — Fly.io's orchestrator that schedules Firecracker-backed Pods. Durable-FSM design (per-step records in BoltDB) lineage-linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes/operations" per JP Phillips's 2025-02-12 exit interview.
systems/flaps — the Machines-API gateway routing incoming HTTPS into per-host flyd RPCs. Decentralised ("for the most part doesn't require any central coordination"), sub-5-second P90 on machine create globally (Johannesburg and Hong Kong excepted). JP's "whole Fly Machines API" framing.
systems/fly-pilot — 2025 successor to init. OCI-compliant runtime with a defined API for flyd to drive; consolidates the feature-bag init described in the 2024-06-19 AWS-without-Access-Keys post. Third of Fly.io's three Rust services (after fly-proxy + corrosion2).
systems/boltdb — flyd's state store. Deliberate non-SQL pick for blast-radius safety + scope discipline; JP's 2025-02-12 defence: "I've never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet."
systems/cadence — HashiCorp-era durable-workflow engine; direct design-ancestry cite for flyd's FSM design via JP Phillips. Not a Fly.io runtime dependency.
systems/firecracker — Fly.io runs user workloads on AWS Firecracker micro-VMs. Substrate/context — Fly.io itself is not the primary wiki source for Firecracker (that's AWS Lambda + Figma), but it is named as Fly.io's isolation layer in every Fly.io source.
systems/intel-cloud-hypervisor — Fly.io's GPU-Machine hypervisor. A "very similar Rust codebase" to Firecracker that supports PCI passthrough; non-GPU Machines run on Firecracker, GPU Machines run on Cloud Hypervisor. First wiki appearance via the 2025-02-14 GPU retrospective.
systems/qemu — the conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io rejected it on millisecond-boot DX grounds. Wiki touchpoint.
systems/vmware — the other conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io explicitly rejected it ("Nvidia suggested VMware (heh)"). Wiki touchpoint.
systems/virtual-kubelet — CNCF-sandbox Virtual Kubelet provider is the pivot that lets FKS run K8s without Nodes; Fly runs a small Go provider alongside K3s.
systems/k3s — the lightweight K8s distribution FKS uses for the control-plane API surface.
systems/fly-proxy — Fly's edge / private-network proxy; backs K8s Services under FKS.
systems/fly-wireguard-mesh — internal IPv6 WireGuard mesh (6PN) that replaces CNI under FKS and connects every Fly Machine across hosts / regions.
systems/flycast — *.flycast private-network hostnames; one of the three FKS Service access paths.
systems/fly-gateway — regional fleet of servers that terminate external customer WireGuard connections from flyctl. Separate substrate from the internal 6PN mesh; the subject of the 2024-03-12 JIT WireGuard rewrite.
systems/wggwd — gateway-side daemon that manages WireGuard interfaces; pull-on-demand peer provisioner post-2024-03-12.
systems/wireguard — underlying protocol for both the internal (6PN) and external (gateway) meshes.
systems/fly-flyctl — Fly.io's CLI; conjures a TCP/IP stack per invocation and speaks WireGuard to Fly Machines.
systems/fly-graphql-api — Fly.io's customer-facing control plane; formerly pushed peer configs to gateways, now serves pull requests on handshake arrival.
systems/oidc-fly-io — Fly.io's in-house OpenID Connect identity provider (oidc.fly.io/<org>). Issues short-lived OIDC JWTs exclusively to Fly Machines, with a structured sub claim of shape <org>:<app>:<machine>. Lets counterparties (AWS, GCP, Azure, any OIDC-compliant cloud) trust Fly Machines as federated identities without any long-lived credential ever being shared. Canonical wiki instance of workload identity.
systems/fly-init — the Rust-written process-zero binary in every Fly Machine. Hosts a Unix-socket API proxy at /.fly/api (Fly's answer to the EC2 Instance Metadata Service, but SSRF-resistant by design) and acts as the credential broker for AWS OIDC federation: detects AWS_ROLE_ARN at boot, fetches an OIDC token, writes it to /.fly/oidc_token, and exports AWS_WEB_IDENTITY_TOKEN_FILE for the AWS SDK's standard credential-provider chain. As of 2025, succeeded by pilot — a full OCI runtime with a formal flyd-driven API — consolidating init's feature bag.
systems/fly-proxy — Fly.io's Rust edge router; one of the three Rust services on Fly's platform (alongside corrosion2 and pilot). Edge servers "exist almost solely" to run it. Built on Tokio + tokio-rustls; terminates TLS, handles HTTP routing decisions, forwards to worker-hosted Fly Machines over Fly's Anycast fabric. Canonical wiki Seen-in: the 2025-02 IAD CPU-busy-loop incident traced to a tokio-rustls TlsStream Waker bug under close_notify with buffered trailer.
systems/rustls / systems/tokio-rustls / systems/tokio — the Rust async / TLS stack fly-proxy is built on. Rustls is "an ultra-important, load-bearing function in the Rust ecosystem"; Fly.io contributed the 2025-02 upstream fix (rustls PR #1950) — canonical Rust-ecosystem instance of patterns/upstream-the-fix.
systems/livebook / FLAME / Nx / BEAM — the Elixir-ecosystem pieces that, with Fly Machines as the executor substrate, compose into notebook-driven elastic GPU compute. FLAME Flame.call blocks pool executors on Fly Machines; Livebook drives them from a laptop; Nx/Axon/ Bumblebee supply the GPU-backed AI/ML primitives; BEAM's native code distribution makes notebook-defined modules executable across the cluster. The Kubernetes-side runtime
FLAME port (Livebook v0.14.1, Michael Ruoss) confirms the pattern is substrate-independent.
systems/tokenized-tokens — Fly.io's secret-tokenization system (2024 post referenced in the 2025-04-08 "Our Best Customers Are Now Robots" post). Hardware- isolated, "robot-free" Fly Machines hold real OAuth / API credentials and substitute them for placeholder tokens at egress; the LLM client never touches the real secret. Canonical wiki substrate for patterns/tokenized-token-broker and concepts/tokenized-secret.
systems/model-context-protocol — the open LLM interop protocol Fly.io names in the 2025-04-08 post. Modern MCP uses long-lived SSE connections; multitenant MCP deployments need session-affinity routing; Fly's dynamic request routing is the platform-level answer. Canonical wiki datum on MCP-SSE-as-routing-requirement is concepts/mcp-long-lived-sse.
systems/flymcp — Fly.io's open-source github.com/superfly/flymcp "most basic" MCP server for flyctl. ~90 lines of Go, 2 tools (fly logs + fly status), MCP stdio transport, built in "30 minutes". Canonical wiki instance of patterns/wrap-cli-as-mcp-server — the pattern works because flyctl --json was already done in 2020. Demonstration of an agentic-incident-diagnosis loop against unpkg; surfaces concepts/local-mcp-server-risk as the structural concern. (Source: sources/2025-04-10-flyio-30-minutes-with-mcp-and-flyctl.)
systems/fly-mcp-launch — fly mcp launch flyctl subcommand (shipped flyctl v0.3.125, 2025-05-19). Takes any stdio MCP server command and deploys it as a remote HTTP MCP server running on a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place across 6 supported clients (Claude / Cursor / Neovim / VS Code / Windsurf / Zed), and --secret flags piped through to Machine secrets. Canonical wiki instance of patterns/remote-mcp-server-via-platform-launcher. Pairs with flymcp to span local ↔ remote MCP-server deployment axis. (Source: sources/2025-05-19-flyio-launching-mcp-servers-on-flyio.)
systems/aws-lambda — positional comparator. The 2025-04-08 post discloses that Fly Machines (non-GPU) run on Lambda's hypervisor — Firecracker. "Not coincidentally, our underlying hypervisor engine is the same as Lambda's." Fly's value-add is Lambda-like start latency plus EC2-like runtime duration + stateful filesystem persistence across stop/start cycles.

Key patterns / concepts¶

Security posture (2025-10-08 Kurt Got Got Twitter ATO)¶

patterns/phishing-resistant-mfa-behind-idp — canonical Fly.io instance. "We get everything behind an IdP (in our case: Google's) and have it require phishing-proof MFA." The regime held when Kurt was phished — no Fly.io infrastructure was at risk; only the one legacy shared account (Twitter/X) outside the regime was taken over.
concepts/phishing-resistant-authentication — mechanism defined as mutual authentication / origin + channel binding; training alone is insufficient ("under continued pressure, everybody clicks"); FIDO2/WebAuthn/Passkeys are the fix.
concepts/origin-bound-credential — the underlying cryptographic property; Fly.io's post is the wiki's canonical articulation.
concepts/legacy-shared-account — Twitter/X account fell outside Google-SSO regime because the platform had been deprioritised in 2023-24 ("might not be a long-term thing for us at all"); became the attacker's entry point when the platform regained relevance.
concepts/account-takeover-ato — canonical incident shape: phishing → credential pulled from 1Password → sign-in to lookalike members-x.com → attacker rotates 2FA + email → ~15-hour recovery through X.com support.
patterns/password-manager-as-phishing-guardrail — 1Password's browser-plugin domain-check is named as the secondary control that would have blocked the autofill; defeated by manual copy-paste.
patterns/incident-response-calibrated-to-blast-radius — watchful-waiting response justified by bounded blast radius; candid postmortem afterward.

Production-incident debugging (2025-02-26 fly-proxy Rust-TLS incident)¶

patterns/flamegraph-to-upstream-fix — canonical Fly.io instance. Symptom (CPU pegging + HTTP errors in IAD) → flamegraph from an angry fly-proxy → tracing::Subscriber hot frames as the busy-loop signature → fully-qualified Future type names tokio_rustls::TlsStream as the guilty layer → pre-existing upstream issue (tokio-rustls#72) → upstream fix as rustls PR #1950 → partner (Tigris) resumes load test, clean.
patterns/upstream-the-fix — Rust-ecosystem variant. Fourth ecosystem instance (after V8/Node.js/OpenNext Cloudflare, Node Web-streams Cloudflare-as-maintainer, and Datadog's Go-toolchain quartet). "TlsStream is an ultra- important, load-bearing function in the Rust ecosystem. Everybody uses it" → patch upstream, not inside Fly.io.
patterns/dependency-update-discipline — Fly.io's self-drawn lesson. "Keeping track of what needs to be updated is valuable work. The updates themselves are pretty fast and simple, but the process and testing infrastructure to confidently metabolize dependency updates is not."
patterns/spurious-wakeup-metric — explicit instrumentation follow-up Fly.io commits to at the end of the post: "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often."
concepts/async-rust-future / concepts/rust-waker / concepts/asyncread-contract — the primitives the incident teaches. The Fly.io post is a well-crafted in-blog primer on these, because the bug only makes sense once you see the contract Waker + AsyncRead must satisfy.
concepts/spurious-wakeup-busy-loop / concepts/cpu-busy-loop-incident — the incident class; the flamegraph dominated by tracing::Subscriber is the signature.
concepts/tls-close-notify — the TLS protocol state whose buffered-trailer edge case triggered the rustls bug.
concepts/flamegraph-profiling — the diagnostic tool.
concepts/durable-execution — flyd's per-FSM-step BoltDB records is Fly.io's canonical instance, framed by JP Phillips's 2025-02-12 exit interview as the load-bearing property of "deploy flyd all day, every day." Ancestry- linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes."
concepts/bolt-vs-sqlite-storage-choice — Fly.io makes the trade both ways in one stack: flyd picks BoltDB for blast-radius-safety on authoritative state; corrosion2 picks CRDT-SQLite for fleet-queryable read-side distribution. Canonical wiki instance of the design decomposition.
patterns/per-instance-embedded-database — JP's "if I had to do it over" alternate design: one SQLite per Fly Machine, zip the DB to object storage on Machine destroy. Not built at Fly.io (flyd today is one BoltDB per host); schema management is the named open problem.
concepts/jit-peer-provisioning — 2024-03-12 gateway rewrite keeps kernel-resident WireGuard peer state near zero by installing on first packet and evicting ruthlessly on cron.
patterns/jit-provisioning-on-first-packet — the architectural pattern behind the gateway rewrite; Fly.io is the canonical wiki instance.
patterns/initiator-responder-role-inversion — the sub-RTT install trick on the JIT path: install the peer in the initiator role so the kernel sends the next handshake to the client at install-time.
patterns/bpf-filter-for-api-event-source — BPF-sniff the WireGuard handshake-initiation byte to manufacture the "incoming connection" event that Netlink doesn't expose.
patterns/pull-on-demand-replacing-push — 2024-era architectural migration at Fly.io, retiring dropped-message NATS pushes on the WireGuard path (and more broadly — flyd went from NATS-driven to HTTP in the same timeframe).
patterns/state-eviction-cron — cheap because JIT.
concepts/packet-sniffing-as-event-source — the architectural move generalised.
concepts/rate-limited-cache — SQLite-backed cache on gateways that shields the Fly API from WireGuard retry storms.
concepts/noise-protocol — the identity-hiding handshake framework underneath WireGuard; forced Fly to implement ~200 lines of crypto on the gateway.
concepts/kernel-state-capacity-limit — the scale-driving constraint behind the gateway rewrite.
concepts/nodeless-kubernetes — running a K8s API without any Node object; FKS is the canonical wiki instance.
concepts/micro-vm-as-pod — Pod as a Firecracker micro-VM rather than a shared-kernel container; FKS instantiates this at the K8s API tier.
concepts/managed-kubernetes-service — spectrum (self-managed → managed-control-plane → managed-data-plane → nodeless); FKS sits at the nodeless end.
concepts/ipv6-service-mesh — WireGuard mesh as the CNI substitute; ClusterIPs are IPv6 under FKS.
patterns/virtual-kubelet-provider — implement a managed- K8s offering by registering a Virtual-Kubelet provider that forwards Pod-creates into a cloud's existing compute API.
patterns/primitive-mapping-k8s-to-cloud — map each K8s primitive 1:1 to a pre-existing cloud primitive instead of reimplementing the reference stack. FKS's {CRI, CNI, Pod, Service, Secret, DNS, PV} → Fly.io table is the canonical wiki instance.
patterns/metadata-db-plus-object-cache-tier — the architectural shape Tigris instantiates on Fly.io (metadata in FDB + byte cache in regional NVMe + distribution queue + pluggable origin). Fly.io is the canonical wiki instance.
patterns/partner-managed-service-as-native-binding — fly storage create turns a third-party Tigris service into a first-party-feeling Fly.io primitive, with app secrets auto-injected.
patterns/unified-billing-across-providers — Tigris, Supabase, PlanetScale, Upstash billed through one Fly.io invoice. "Everything gets charged to your Fly.io bill and you pay one bill per month."
concepts/demand-driven-replication — the Tigris replication policy for larger objects.
concepts/immutable-object-storage — Tigris preserves the S3 immutable-objects contract on top of its distributed byte plane.
concepts/inference-vs-training-workload-shape — Fly.io's canonical statement of the workload-shape distinction: "Training workloads tend to look more like batch jobs, and inference tends to look more like transactions. Batch training jobs aren't that sensitive to networking or even reliability. Live inference jobs responding to end-user HTTP requests are." Basis for the GPU product strategy pivot.
concepts/inference-compute-storage-network-locality — GPU
instance RAM + object storage + Anycast network combined on one platform, rather than any single axis maximised. Fly.io's thesis for why inference doesn't need frontier GPUs and why hyperscaler GPU instances (with egress) underdeliver.
concepts/egress-cost — the hyperscaler-surcharge + egress-fee squeeze is Fly.io's framing of why cross-cloud GPU-inference topologies lose.
concepts/anycast — the network-locality axis of the inference-locality thesis (alongside Tigris + GPU Machines).
patterns/co-located-inference-gpu-and-object-storage — the pattern the L40S + Tigris + Anycast combination instantiates; Fly.io × Tigris is the canonical wiki instance.
concepts/workload-identity — Fly Machines as the canonical wiki instance of platform-attested identity (<org>:<app>:<machine> in the OIDC sub claim), used to obtain credentials to other clouds without sharing any long-lived secret.
concepts/oidc-federation-for-cloud-access — Fly.io → AWS via systems/oidc-fly-io + systems/aws-sts's AssumeRoleWithWebIdentity; the canonical cross-cloud federation wiki source is sources/2024-06-19-flyio-aws-without-access-keys.
concepts/short-lived-credential-auth — Fly.io's framing: "dead in minutes, sharply limited blast radius, rotate themselves, fail closed" — the canonical wiki line for what STS credentials buy you.
concepts/machine-metadata-service — Fly's /.fly/api Unix socket is self-described as "our answer to the EC2 Instant Metadata Service", but SSRF-resistant by design because a Unix socket is not HTTP-reachable by default.
concepts/unix-socket-api-proxy — the specific IPC shape Fly uses instead of link-local HTTP; gives the Macaroon attach-on-outbound model its privilege-separation properties.
patterns/oidc-role-assumption-for-cross-cloud-auth — the end-to-end cross-cloud pattern Fly.io instantiates for AWS (and will instantiate for GCP / Azure when someone asks).
patterns/init-as-credential-broker — Fly init as the guest-side plumbing that closes the loop between platform-attested identity and the AWS SDK's credential-provider chain with a single environment variable.
patterns/sub-field-scoping-for-role-trust — Fly's choice of <org>:<app>:<machine> as the sub shape is what lets AWS trust policies scope a Role to an org, an app, or a single Machine via StringLike match on a prefix.

Storage + migration (2024-07-30 rebuild)¶

concepts/bus-hop-storage-tradeoff — Fly.io's canonical self-assessment: local NVMe trades operational simplicity (drain becomes hard) for bus-hop read latency and startup- era affordability.
concepts/fleet-drain-operation — The on-call primitive "drain that worker"; the runbook that stateful Machines broke for three years until the 2024 rebuild restored it.
concepts/kill-copy-boot-migration-tradeoff — The classical stateful-migration dilemma: copy-boot-kill loses data, kill-copy-boot takes too long. Fly.io's clone primitive resolves it.
concepts/block-level-async-clone — The architectural pattern dm-clone instantiates at the kernel block-device tier; shape-parallel to Cloudflare Artifacts' async clone at the Git tier.
concepts/trim-discard-integration — Using fstrim + DISCARD to short-circuit hydration of unused blocks on sparse Fly Volumes.
concepts/heterogeneous-fleet-config-skew — cryptsetup version skew causing LUKS2 header-size drift across Fly's workers; generalises to any aging multi-host fleet.
concepts/embedded-routing-in-ip-address — Fly's 6PN design; trades distributed-routing-protocol cost for address- stability-on-migration cost.
concepts/hardcoded-literal-address-antipattern — Fly Postgres cluster configs with literal 6PN IPv6 addresses; the antipattern that forced a fleet-wide config rewrite.
patterns/async-block-clone-for-stateful-migration — The end-to-end migration recipe: kill → clone → boot with iSCSI + dm-clone + background kcopyd hydration + flyd-orchestrated FSMs. Canonical wiki instance.
patterns/temporary-san-for-fleet-drain — The fleet-level shape: worker physicals become temporary SANs serving Volumes to fresh-booted replicas on target workers, evaporating when hydration completes. Canonical wiki instance.
patterns/embedded-routing-header-as-address — The general pattern behind 6PN.
patterns/fsm-rpc-for-config-metadata-transfer — Fly's migration FSM carries LUKS2 header metadata from source to target worker; generalisable defence for any cross-host operation against fleet config skew.
patterns/feature-gate-pre-migration-network-rewrite — Ship an in-init address-mapping feature first, then do the fleet-wide config rewrite; Fly's handling of hardcoded literal 6PN addresses in Fly Postgres configs.

Notebook-driven elastic GPU compute (2024-09-24 keynote)¶

patterns/notebook-driven-elastic-compute — End-to-end shape: Livebook cell → ephemeral Fly Machine cluster → streamed-back results → cluster tear-down on disconnect. Canonical wiki instance.
patterns/framework-managed-executor-pool — FLAME's architectural pattern: library manages a min/max/concurrency-bounded pool of executor Machines; inline Flame.call replaces function-per-operation decomposition.
concepts/seconds-scale-gpu-cluster-boot — Fly's platform-level claim that a 64-node GPU cluster comes up in seconds from a Docker image; the load-bearing latency property behind the notebook UX.
concepts/transparent-cluster-code-distribution — BEAM primitive Livebook exposes: modules defined in a notebook cell run on any executor without a deploy step.

GPU scale-to-zero (single-Machine) — 2024-05-09 image-description walkthrough¶

patterns/proxy-autostop-for-gpu-cost-control — Canonical Fly.io instance. Fly Proxy owns start/stop of a GPU Fly Machine: stops on idle silence (minutes-scale), starts on next internal request. App tier never decides; proxy does. The load-bearing cost-control primitive that makes single-Machine GPU inference hobby-project-affordable on a cloud.
patterns/flycast-scoped-internal-inference-endpoint — The access-scoping pre-requisite that makes autostop's "idle" definition meaningful. Flycast hostname scopes inference-tier access to same-org 6PN traffic only, so random internet scans don't wake the GPU Machine.
concepts/gpu-scale-to-zero-cold-start — The tail pattern-eats: three-stage cold-start budget (Machine-start seconds + model-load-into-GPU-RAM tens of seconds + first-response seconds) disclosed as ~45 seconds on a100-40gb + LLaVA-34b. Different dominant stage from CPU/serverless cold-start.

Remote development / agentic loops (2025-02-07 commentary)¶

concepts/remote-development-environment — the architectural space Fly.io's 2025-02-07 post operates in; names the two opposite architectures (Emacs Tramp vs VSCode Remote-SSH) on the same SSH substrate.
concepts/live-off-the-land — the Tramp posture; canonical wiki instance is now on the wiki via Fly's 2025-02-07 framing.
concepts/agentic-development-loop — Fly.io's canonical phrasing of the closed-loop LLM coding workflow: "close the loop between the LLM and the execution environment […] a semi-effective antidote to hallucination." The 2025-motivating use-case for why anyone wants disposable-VM dev sandboxes.
patterns/stager-downloads-agent-for-remote-control — the VSCode Remote-SSH architectural pattern Fly critiques; ships a full Node.js runtime + agent to the target host via a bash stager, exposes a WebSocket RPC over an SSH port-forward, persists across reconnects.
patterns/disposable-vm-for-agentic-loop — the architectural answer Fly.io's 2025-02-07 post argues for: run the agentic loop on a clean-slate, instant-boot, discardable VM (a Fly Machine, unsurprisingly). Canonical wiki instance.

GPU product retrenchment (2025-02-14 retrospective)¶

concepts/developers-want-llms-not-gpus — Fly.io's canonical demand-side diagnosis. "Developers don't want GPUs. They don't even want AI/ML models. They want LLMs." The 10,000-vs-5-6-developer credo applied to GPU Machines — GPU workloads land on the 5-6 side. Fly.io's canonical wiki instance.
concepts/gpu-as-hostile-peripheral — the security framing that shaped GPU Machines' productisation. "A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers, with arbitrary, end-user controlled computation, all operating outside our normal security boundary." Canonical wiki instance.
concepts/nvidia-driver-happy-path — Fly.io canonically discloses the shape of the Nvidia driver happy path (K8s with shared kernel, or QEMU/VMware) and the cost of deviating. Months of failed Cloud Hypervisor integration work; hex-edited closed-source drivers to impersonate QEMU; no MIG path to thin-sliced GPUs because MIG presents as a UUID not a PCI device. Canonical wiki instance.
concepts/fast-vm-boot-dx — the DX property Fly.io refused to trade for Nvidia-driver-happy-path compatibility. "We could not have offered our desired Developer Experience on the Nvidia happy-path." Canonical wiki statement.
concepts/asset-backed-bet — Fly.io's risk-framing: GPUs are tradable assets with durable value, so the downside of being wrong about the GPU bet is partially recoverable via liquidation. Companion to the IPv4 address portfolio framing. Canonical wiki instance.
concepts/insurgent-cloud-constraints — the broader structural framing for why Fly.io can't compete with OpenAI/Anthropic on the LLM-serving axis. Canonical wiki statement.
concepts/product-market-fit — the meta-framing: "a startup is a race to learn stuff." Course-correction without shame when a bet doesn't hit PMF.
patterns/dedicated-host-pool-for-hostile-peripheral — Fly's GPU Machines run on dedicated GPU-only workers on Cloud Hypervisor; non-GPU Machines run on Firecracker workers. Isolation posture: peripheral-class segregation at the placement tier. Canonical wiki instance.
patterns/independent-security-assessment-for-hardware-peripheral — Fly.io's GPU deployment was cleared by two independent external security audits (Atredis, Tetrel) before productisation. "They were not cheap, and they took time." Canonical wiki instance.
patterns/platform-retrenchment-without-customer-abandonment — Fly.io's 2025-02-14 announcement "if you're using Fly GPU Machines, don't freak out; we're not getting rid of them. But if you're waiting for us to do something bigger with them, a v2 of the product, you'll probably be waiting awhile." Keep-running + no-v2 = retrenchment without abandonment. Canonical wiki instance.

Robot Experience (RX) / robots-as-customers (2025-04-08)¶

concepts/robot-experience-rx — Fly.io introduces RX (Robot Experience) as a product-design axis alongside UX and DX in the 2025-04-08 post. "One of our north stars has always been nailing the DX of a public cloud. But the robots aren't going anywhere. It's time to start thinking about what it means to have a good RX. […] We have not yet nailed the RX; nobody has. But it's an interesting question." Canonical wiki instance of the framing.
concepts/vibe-coding — Fly.io's gloss on the LLM-driven conversational code-generation workflow: bursty-then-idle ("frenzy of activity for a minute or so, but then chill out for minutes, hours, or days"). The canonical wiki workload-shape phrasing.
concepts/fly-machine-start-vs-create — the lifecycle-split primitive robots consume and humans don't grok. "Start is lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash the create button to boot apps up in Fly Machines. But the robots are getting a lot of value out of it." Canonical wiki instance.
concepts/stateful-incremental-vm-build — the robot-workload storage shape that forces filesystem + object-storage primitives over the Postgres-first human default. "As product thinkers, our intuition about storage is 'just give people Postgres'. […] But because LLMs are doing the Cursed and Defiled Root Chalice Dungeon version of app construction, what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."
concepts/mcp-long-lived-sse — the networking-tier reason Fly.io's Fly Proxy dynamic request routing is "a robot attractant" — multitenant MCP SSE deployments require session-affinity routing.
concepts/tokenized-secret — the identity-plane RX primitive: LLMs get a placeholder that a hardware-isolated Fly Machine substitutes for the real OAuth token at egress. Grounded in Fly.io's 2024 tokenized-tokens substrate.
patterns/start-fast-create-slow-machine-lifecycle — expose two lifecycle paths (slow create, fast start) so bursty-then-idle workloads resume at invocation latency from idle. Canonical wiki instance.
patterns/session-affinity-for-mcp-sse — route long- lived MCP SSE connections from a given client back to the same stateful instance. Fly.io instantiates via tenant-controlled dynamic request routing; canonical wiki instance.
patterns/tokenized-token-broker — hardware-isolated broker substitutes real secrets for placeholders at egress; adjacent to Fly's init-as-credential- broker pattern (STS / OIDC federation variant) and extends it to arbitrary OAuth-style credentials. Canonical wiki instance.

Wrap-CLI-as-MCP / local-MCP-risk (2025-04-10 → 2025-05-07)¶

patterns/wrap-cli-as-mcp-server — canonical wiki pattern (Fly.io's flymcp: 90 LoC Go, 2 tools, stdio transport, 30 minutes). Viable because flyctl --json was already done in 2020; pass-through CLI credentials; no auth/transport layer. Demonstrated generalisable via the unpkg incident-diagnosis flow. 2025-05-07 mutation transition: same server, now full fly volumes CRUD; first wiki instance of the pattern crossing the read-only → production-mutation boundary.
concepts/local-mcp-server-risk — canonical wiki statement of the "giving a cloud LLM the ability to run a native program on my machine" concern. Ptacek: "Local MCP servers are scary." The 2025-05-07 post inherits the posture with mutation authority; "if you ask it to destroy a volume, that operation is not reversable." Mitigation: patterns/disposable-vm-for-agentic-loop (run the wrapped CLI inside a throwaway Fly Machine rather than on the operator's laptop — sketched 2 months earlier in the 2025-02-07 VSCode-SSH-bananas post).
concepts/structured-output-reliability — extended with the upstream variant: Fly.io's 2020 --json decision is the producer-side instance of structured-output discipline that made LLM-consumer tooling trivially viable 5 years later. Different shape from the Dash-judge case (LLM as producer there, LLM as consumer here), same underlying lesson.
concepts/agent-ergonomic-cli — cross-vendor confirmation of the Cloudflare framing: flyctl's structured-output axis predates and survives the LLM era as a general automation property LLMs retroactively weaponise. 2025-05-07 extends with the three-way alternative-rejection framing (API / CLI / dashboard all lose to NL + MCP).
concepts/natural-language-infrastructure-provisioning — canonical wiki thesis. "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." "Make it so" as the target UX.
patterns/plan-then-apply-agent-provisioning — the aspirational UX the 2025-05-07 post sketches — LLM scans code, presents a plan, human adjusts + approves, agent executes, on failure examines logs and proposes next steps. Terraform's plan / apply discipline reimplemented as a conversation. Not yet shipped in flyctl v0.3.117; roadmap target.
patterns/cli-safety-as-agent-guardrail — the mounted-volume-refusal invariant as the zero-cost guardrail that let Fly.io safely ship a mutation-authority MCP surface. "I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command." Mutation-side twin of the --json-as-load-bearing observation — mature CLI design pays an AI-integration dividend the original authors never intended.

Cloud IDE for coding agents (2025-06-20 Phoenix.new)¶

patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock — canonical wiki pattern. Productised as Phoenix.new: per-session Fly Machine, browser-delivered VSCode-style UI, root shell shared with agent, *.phx.run preview URLs. Substrate realisation of patterns/disposable-vm-for-agentic-loop four months after the 2025-02-07 sketch.
patterns/agent-driven-headless-browser — canonical colocated-browser instance. Full Chrome inside the session VM the coding agent drives via CDP; sibling of the MoltWorker proxied-browser instance. Three-signal closed loop on Phoenix.new (server logs + browser DOM/JS state + mix test output).
patterns/ephemeral-preview-url-via-port-forward — canonical *.phx.run instance. Any port bound in the session VM becomes a publicly-shareable URL automatically; deploy step collapses to zero. Directly addresses Karpathy's "code was the easy part; getting it online took a week" pain.
patterns/agentic-pr-triage — canonical wiki pattern. McCord's own usage of Phoenix.new against phoenix-core: "I close my laptop, grab a cup of coffee, and wait for a PR to arrive." Combines ephemeral-VM cloud IDE + gh + issue-tracker filter to let the agent pick work asynchronously.
concepts/cloud-ide — product-category framing. Phoenix.new as agent-centered ephemeral cloud IDE, contrasted with human-centered persistent cloud IDEs (Codespaces / Gitpod).
concepts/ephemeral-dev-environment — session-scoped dev environment concept; canonical productised instance.
concepts/agent-with-root-shell — canonical wiki statement. "It owns the whole environment." Coarse-grained perimeter (VM boundary) posture contrasting with fine-grained capability sandbox posture (Cloudflare Project Think / EmDash).
concepts/agent-driven-browser — canonical wiki statement. "Instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state."
concepts/ephemeral-preview-url — canonical *.phx.run instance.
concepts/async-agent-workflow — coding-agent specialisation of the 2025-04-08 RX thesis. "The future of development … probably looks less like cracking open a shell and finding a file to edit, and more like popping into a CI environment with agents working away around the clock."

Minimal-agent-loop pedagogy (2025-11-06 Ptacek essay)¶

Canonical statements of agent-architecture vocabulary the wiki had in use but un-named. From "You Should Write An Agent":

concepts/agent-loop-stateless-llm — "The LLM itself is a stateless black box. The conversation we're having is an illusion we cast, on ourselves." Canonical 15-LoC Python statement of the primitive every agent on the wiki composes over.
concepts/context-window-as-token-budget — "You're allotted a fixed number of tokens in any context window. Each input you feed in, each output you save, each tool you describe, and each tool output eats tokens." Independent confirmation of the context-window-as-budget framing also surfaced at Dropbox Dash (2025-11-17) and Datadog (2026-03-04). Degradation is nondeterministic: "the whole system begins getting nondeterministically stupider."
concepts/context-engineering — "Turns out: context engineering is a straightforwardly legible programming problem. […] If Context Engineering was an Advent of Code problem, it'd occur mid-December. It's programming." Canonical statement repudiating the "magic spells" framing of prompt engineering.
concepts/sub-agent — "Just a new context array, another call to the model. Give each call different tools." Demystifies Claude Code's sub-agents primitive; complementary to existing patterns/specialized-agent-decomposition + patterns/coordinator-sub-reviewer-orchestration.
patterns/tool-call-loop-minimal-agent — ~30-LoC Python teaching shape for every tool-using agent on the wiki. Emergent multi-step planning: "Did you notice where I wrote the loop in this agent to go find and ping multiple Google properties? Yeah, neither did I."
patterns/context-segregated-sub-agents — security-, budget-, and specialisation-motivated sub-agent pattern. "You can trivially build an agent with segregated contexts, each with specific tools. That makes LLM security interesting."

The essay is also the wiki's first explicit MCP-is-optional framing (extended onto the MCP system page): when you own both the agent and the tools, native tool-schema JSON against the LLM endpoint is sufficient — MCP earns its place as an interop layer for tools consumed by agents someone else built.

Architectural framing¶

Fly.io's self-description positions it explicitly as a compute + networking platform, with adjacent concerns (storage, databases, object storage) delivered via partner integrations rather than in-house builds. Tigris is the canonical example of this model on the object-storage axis: "we partnered with Tigris, so that they can put their full resources into making object storage as magical as Fly.io is." The blog frames this as a Unix-philosophy posture — "you have individual parts that do one thing very well that are then chained together to create a composite result." The customer-facing trade is that all the parts bill through Fly.io.

The Tigris integration also shows Fly.io willing to be the substrate for the partner (Tigris runs on Fly.io's NVMe volumes and regions), not just a customer. That's a different shape from the typical cloud-partnership pattern of "we wire up your SDK" — Fly.io is renting Tigris the hardware.

Recent articles¶

2026-06-08 — Building Agents that Don't Break Themselves — Practical guide: separate agent loop from execution via Sprites. Two implementations: SpriteDoc (per-session ephemeral) and Hermes Agent (per-task persistent). Demonstrates concepts/ephemeral-credential-injection and patterns/checkpoint-before-risky-step.
2025-10-08 — Kurt Got Got — Security postmortem: CEO Kurt Mackey was phished; the @flydotio Twitter/X account was taken over for ~15 hours; Fly.io's infrastructure was unaffected because it all sits behind Google SSO + phishing-proof MFA. Canonical wiki instance of the phishing- resistant-MFA-behind-IdP pattern — "we get everything behind an IdP (in our case: Google's) and have it require phishing-proof MFA. You're unlikely to phish your way to viewing logs here, or to refunding a customer bill at Stripe, or to viewing infra metrics, because all these things require an SSO login through Google." Names four reusable architectural claims: (1) phishing training is not sufficient — "under continued pressure, everybody clicks" (cites Ho et al. OAKLAND 2025); (2) the fix is phishing- resistant authentication via FIDO2/WebAuthn/Passkeys, which defeats phishing via **mutual authentication / origin
channel binding (concepts/origin-bound-credential) — "your browser won't send real credentials to the fake site"; (3) everything-behind-an-IdP is the operational shape and the attack surface becomes whatever's outside the regime; (4) the legacy-shared-account failure mode (concepts/legacy-shared-account) — Twitter was exactly that: a deprioritised social account shared via systems/1password|1Password with a contractor because Fly.io had 2023-24-decamped to Mastodon and Bluesky, and the account fell below the security-hygiene threshold. Detection latency ~45 seconds (via email-of-record-change notification); recovery ~15 hours (X.com support path — "not outside industry norms"). Response mode: watchful waiting because blast radius was bounded (patterns/incident-response-calibrated-to-blast-radius) — "our users weren't under attack, and the account wasn't being used to further intercept customer accounts". 1Password's browser-plugin domain-check is named as the control that would have blocked the autofill on the members-x.com lookalike — defeated by manual copy-paste (patterns/password-manager-as-phishing-guardrail). Post-incident fix: Passkeys on Twitter. Twentieth Fly.io ingest; first security-focused one. Borderline Tier-3 inclusion justified by the explicit articulation of origin-binding mechanism and the "everything-behind-an-IdP" pattern statement — both reusable across any org's security posture, both first canonical wiki instances.
2026-03-10 — Unfortunately, Sprites Now Speak MCP — Ptacek ships sprites.dev/mcp (the vendor-hosted remote MCP server for Sprites CRUD) while simultaneously publishing the canonical Fly.io statement of the MCP-is-aesthetically-wrong thesis. Two-layer disclosure: (1) a concrete system ship — sprites.dev/mcp with a three-axis creation guardrail (org scope × Sprite count cap × name prefix; first wiki instance of a three-axis quota at the VM-lifecycle altitude; see concepts/ai-agent-guardrails); (2) the first-party framing statement: "In 2026, MCP is the wrong way to extend the capabilities of an agent. The emerging Right Way to do this is command line tools and discoverable APIs." Introduces two new cost axes distinct-from-token-budget for context-bloat: progressive capability disclosure (CLIs reveal capabilities incrementally via subcommands; MCP front-loads the whole tool list) and context as importance signal ("Cramming your context full of MCP tool descriptions is a way of signaling to the model that those tools are important to you"). Playwright coin-flip is the canonical empirical datum: "Ask claude to install Playwright and Chrome and there's a coinflip chance it sets up the MCP server. But notice that when the coin comes up tails, Playwright still works." Positional recommendation: MCP as fallback for shell-less agents (patterns/mcp-as-fallback-for-shell-less-agents); CLI
short-prose skill for shell-capable. For Claude Code, Codex, Gemini CLI, "you should just show your agent the sprite CLI and let it impress you." Re-positions patterns/wrap-cli-as-mcp-server as the fallback shape, not the universal-integration shape. Closes with the "Fuck Stateless Sandboxes" reprise of the 2026-01-09 "Fuck Ephemeral Sandboxes" manifesto — the banner evolves from ephemeral (lifetime) to stateless (reachability). Example prompts demonstrate the durable- sandbox workload shape: "benchmark this function across 1000 runs", "run a load generator against this endpoint for 60 seconds", "On 3 new Sprites, change this service to use each of these 3 query libraries" — all overshoot a 15-minute ephemeral budget. Nineteenth Fly.io ingest. Internal tension surfaced: Fly.io ships what Ptacek considers aesthetically wrong because the shell-less-agent market needs it; the pattern is re-framed, not retracted.
2026-02-04 — Litestream Writable VFS — Ben Johnson's follow-up to the 2025-12-11 Litestream VFS ship post, adding two new opt-in modes driven by Fly.io Sprites' storage stack: (1) writable VFS (LITESTREAM_WRITE_ENABLED=true) — single-writer, polling disabled, writes go to a local buffer and sync to object storage every ~1 s or on clean shutdown; durability matches the Sprite substrate's "eventual durability" class; Johnson explicitly rules out multi-writer distributed SQLite as "the Lament Configuration". (2) background hydration (LITESTREAM_HYDRATION_PATH=...) — shape "shoplifted from systems like dm-clone"; the VFS serves reads from object storage while a background loop pulls the whole database to a local temp file using LTX compaction ("only the latest versions of each page"); reads cut over once hydration completes; file discarded on process exit. Canonical database-granularity instance of async-clone + background hydration (prior instances were block-level via dm-clone and Git-tree-level via ArtifactFS) and extended writable variant of VFS range-GET. Block-map disclosure: names the Sprite metadata DB "the block map" — a (file, chunks → object-store keys) map, "low tens of megabytes worst case" — that must serve writes milliseconds after Sprite boot during request-driven Sprite wake. The metadata backend is SQLite + Litestream VFS (writable + hydration), i.e., the JuiceFS metadata tier itself roots at object storage — canonical wiki instance of the metadata+chunk stack applied recursively. Also confirms the global Sprites orchestrator runs directly off S3-compatible object storage with per-org SQLite DBs synchronized by Litestream ("unlike our flagship Fly Machines product, which relies on a centralized Postgres cluster"). Eighteenth Fly.io ingest; closes the Litestream/VFS-series loop by pinning the production consumer (Sprites) behind the read-only shape shipped in 2025-12-11. Both features deliberately scoped narrow to Sprites — "they probably don't make sense for your application" — ordinary read/write workloads still want sidecar-mode Litestream + local SQLite.
2026-01-14 — The Design & Implementation of Sprites — Thomas Ptacek's implementation deep-dive for Sprites, 5 days after the [[sources/2026-01-09-flyio-code-and-let-live|2026-01-09 launch-plus-manifesto]]. Discloses the three orchestration decisions that let Sprites ship the durable- plus-instant-create shape without Fly Machines' baggage. (1) No user-facing container image. "Sprites get rid of the user-facing container. Literally: problem solved. […] Every physical worker knows exactly what container the next Sprite is going to start with, so it's easy for us to keep pools of 'empty' Sprites standing by." The 1-2s create is warm-pool dequeue, not cold boot. Canonical wiki statements of concepts/no-container-image-sprite + concepts/warm-sprite-pool + [[patterns/warm-pool-zero- create-path]]. (2) Object storage as root of disk. 100 GB durable storage backed by an S3-compatible object store via a hacked-up JuiceFS with a rewritten SQLite metadata backend kept durable with Litestream. A sparse 100 GB NVMe volume is a dm-cache-style read-through cache over immutable chunks — "nothing in that NVMe volume should matter; stored chunks are immutable and their true state lives on the object store." Consequences: durable state of a Sprite is a URL, migration and physical-failure recovery are trivial, checkpoints and restores reduce to metadata-shuffle ("like a git restore, not a system restore"), the 3-year async-block-clone migration machinery is obviated. Restores drain-as-button- press that Fly Volumes broke. Canonical wiki statements of concepts/object-storage-as-disk-root + concepts/read-through-nvme-cache + concepts/metadata-data-split-storage + concepts/fast-checkpoint-via-metadata-shuffle + concepts/durable-state-as-url + concepts/fleet-drain-operation; patterns patterns/read-through-object-store-volume + patterns/metadata-plus-chunk-storage-stack + patterns/checkpoint-as-metadata-clone. (3) Inside-out orchestration. "In the cloud hosting industry, user applications are managed by two separate, yet equally important components: the host, which orchestrates workloads, and the guest, which runs them. Sprites flip that on its head: the most important orchestration and management work happens inside the VM." User code runs in an inner container slid between the user and the kernel; the VM root namespace hosts the storage stack, service manager, log pipeline, port-forwarding proxy, and platform API. Two properties: bounce user code without kernel reboot (inner container restart is cheap) and blast radius = new VMs pick up the change (existing VMs run old platform code until they bounce — deploys don't touch global state). Ptacek's meta-argument on platform-team velocity: "We sleep on how much platform work doesn't get done not because the code is hard to write, but because it's so time-consuming to ensure benign-looking changes don't throw the whole fleet into metastable failure." Canonical wiki statements of concepts/inside-out-orchestration + concepts/inner-container-vm; patterns patterns/inside-out-vm-orchestration + patterns/blast-radius-in-vm-not-host. Two other substantive disclosures: (a) the global orchestrator is an Elixir/Phoenix app using object storage for account metadata, with a per-account SQLite DB made durable via Litestream — first wiki instance of Phoenix as an internal-orchestration substrate for a Fly.io product, and a multi-tenant extension of Litestream's wildcard- replication story from the 2025-05-20 revamp; (b) Sprites plug into existing Corrosion for Anycast URL propagation — "When you ask the Sprite API to make a public URL … we generate a Corrosion update that propagates across our fleet instantly" — no new state-distribution system for the product. Explicit positioning vs. Fly Machines: Sprites and Fly Machines coexist, optimised for different workloads — "An automated workflow for [prototype on Sprites, ship on Fly Machines] will happen." Names LSVD as the Fly-Machine-side object- storage experiment whose perf ceiling ("not adequate for a hot Postgres node") Sprites route around by operating at filesystem granularity, not block-device granularity. Final contract framing: "Sprites are a contract with user code: an API and a set of expectations about how the execution environment works. Today, they run on top of Fly Machines. But they don't have to." Jerome Petazzoni is working on an open-source local runtime.
2025-12-11 — Litestream VFS — Ben Johnson's ship announcement for Litestream VFS — the SQLite VFS extension teased in the [[sources/2025-05-20-flyio-litestream-revamped|2025-05-20 design post]] and explicitly flagged as "not yet shipped" in the [[sources/2025-10-02-flyio-litestream-v050-is-here|2025-10-02 v0.5.0 post]] is now live. Load litestream.so as a SQLite extension, open a database with file:///my.db?vfs=litestream, and SQLite reads its pages "hot off an object storage URL" — no full database download, no local replica process, no FUSE. Individual page reads are resolved by HTTP Range GETs against LTX files in S3-compatible storage, using a new ~1%-of-file end-of-file index trailer to map page-number → byte-offset. An LRU cache fronts the range reads, exploiting SQLite's B-tree hot-set shape ("inner branch pages or the leftmost leaf pages for tables with an auto-incrementing ID field"). Because LTX compaction's L0 level uploads every second (retained only until L1 compaction), the VFS can poll the L0 path and incrementally update its index — "the result is a near-realtime replica" that also "starts up really fast". Point-in-time recovery is now expressed in SQL via a new PRAGMA litestream_time (relative: '5 minutes ago'; absolute: '2000-01-01T00:00:00Z') — the post demonstrates rewinding a live UPDATE sandwich_ratings SET stars = 1 disaster on a dev machine with two SQL statements. Litestream the Unix program still handles the write side; the VFS extension handles only the read side — opt-in, additive, doesn't replace anything. Positions Litestream as the PITR + fork primitive agentic coding platforms (Phoenix.new framing) actually want: "We're living an age of increasingly ephemeral servers, what with the AIs and the agents and the clouds and the hoyvin-glavins." Canonical ship-post for the new pattern patterns/vfs-range-get-from-object-store + concepts concepts/ltx-index-trailer + concepts/pragma-based-pitr. Also refines the v0.5.0 compaction ladder with explicit L0 (1 file/sec, retained until L1) on top of L1/L2/L3 (30s/5m/1h)
daily full snapshots above L3.
2026-01-09 — Code And Let Live — Thomas Ptacek's product-launch-plus-manifesto announcing Sprites — Fly.io's new durable-micro- VM substrate for coding agents — and arguing that ephemeral read-only sandboxes for coding agents are obsolete. Opening demo: sprite create drops into a root shell "in about the same amount of time it would take to ssh into a host that already existed", apt-get install ffmpeg, sprite-env checkpoints create ("completes instantly"), walk away "days, even", come back, sprite console, FFmpeg still there — "Sprites are durable. 100GB capacity to start, no ceremony." Disaster recovery demo: rm -rf $HMOE/bin / dd if=/dev/random of=/dev/vdb / ill-advised global pip3 install → sprite checkpoint restore v1 in ~1 second — "Not an escape hatch. Rather: an intended part of the ordinary course of using a Sprite. Like git, but for the whole system" — canonical concepts/first-class-checkpoint-restore. Core thesis (concepts/durable-vs-ephemeral-sandbox): "Claude Doesn't Want A Stateless Container." The stateless-container abstraction ("good idea, so popular that most places you can run code in the cloud look like stateless containers. Fly Machines, our flagship offering, look like stateless containers") was built for professional developers, not coding agents: "Claude is a hyper-productive five-year-old savant. […] They don't want containers. They don't want 'sandboxes'. They want computers." Three supporting arguments: (a) Simple Wins — without a durable computer, the industry spends "tens of millions of dollars figuring out how to snapshot and restore ephemeral sandboxes" (rebuild node_modules, stand up S3/Redis/RDS outside the sandbox because the agent can't trust a local file, "round-trip state through 'plan files' which are ostensibly prose but often really just egregiously-encoded key-value stores") — "unnecessary. Instead of figuring them out, just use an actual computer"; (b) time-limited sandboxes can't host compute-heavy or network-heavy agent workloads ("I built the documentation site for the Sprites API by having a Claude Sprite interact with the code and our API, building and testing examples for the API one at a time. There are APIs where the client interaction time alone would blow sandbox budgets"); (c) Galaxy Brain Win — software development is changing; Ptacek vibe-coded an MDM for his kids' devices, "SQLite-backed Go application running on a Sprite" with APNS Push Certificates + Anycast MDM-registration URL, running a month ("For this app, dev is prod, prod is dev") — "most apps don't want to serve millions of people. The most important day-to-day apps disproportionately won't have million-person audiences." Extends concepts/vibe-coding from throw-away-prototype framing to long-running-personal-app framing. Sprite distinguishing properties: (1) "casually create hundreds of them (without needing a Docker container), each appearing in 1-2 seconds"; (2) "go idle and stop metering automatically, so it's cheap to have lots of them. I use dozens" — first wiki instance of concepts/scale-to-zero applied to a durable substrate (VM not de-allocated, just de-metered); (3) concepts/anycast HTTPS URL per Sprite (fourth Fly.io Anycast deployment); (4) "fully durable. They don't die until I tell them to" — weeks, months, indefinite. Self-critique of Fly Machines — "We built a platform for horizontal-scaling production applications with micro-VMs that boot so quickly that, if you hold them in exactly the right way, you can do a pretty decent code sandbox with them. But it's always been a square peg, round hole situation." Sprites are explicitly not Fly Machines — "They have an entirely new storage stack. They're orchestrated differently. No Dockerfiles." Architecture-level details deferred to a promised follow-up ("we wrote another 1000 words about how they work, but I cut them out because I want to stop talking about our products now and get to my point"). Closing call: "Fuck Ephemeral Sandboxes. […] The age of sandboxes is over. The time of the disposable computer has come." New wiki pages: systems/fly-sprites, concepts/durable-vs-ephemeral-sandbox, concepts/first-class-checkpoint-restore, patterns/durable-micro-vm-for-agentic-loop. Reframes the existing patterns/disposable-vm-for-agentic-loop — same author, 11 months earlier, same Fly.io substrate — as appropriate-sometimes rather than the default. The two patterns now sit as design choices on the concepts/durable-vs-ephemeral-sandbox axis; ephemeral remains the right fit for safety-critical one-shot evaluations, CI-like test runs, and other clean-slate-by- construction workloads. Fifteenth Fly.io ingest on the wiki.
2025-11-06 — You Should Write An Agent — Thomas Ptacek's pedagogical essay canonicalising agent-architecture vocabulary the wiki had in use but un-named. An agent is an HTTP client against one endpoint, a Python list as "context", and a while loop — "It's incredibly easy." Four demonstrations build up in ~60 lines of Python: a 15-LoC ChatGPT clone exposing the stateless-LLM + replayed- context illusion; an Alph / Ralph truth/lies personality split showing two context arrays cost the same as one; a three-function upgrade to a tool-using agent that "figures out" multi-step ping probing of google.com / www.google.com / 8.8.8.8 without the author writing the loop (patterns/tool-call-loop-minimal-agent); and a design-space survey covering sub-agents ("just a new context array, another call to the model"), summarisation-as-compression, and concepts/context-engineering as a "straightforwardly legible programming problem". Canonical statement of concepts/context-window-as-token-budget — tool descriptions, tool outputs, and stored replies all compete for the same token budget; past a threshold "the whole system begins getting nondeterministically stupider." Also the wiki's first explicit MCP-is-optional framing: "we didn't need MCP at all. […] MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don't control. Write your own agent. Be a programmer. Deal in APIs, not plugins." Positions MCP correctly as an interop layer for tools consumed by agents-someone-else-built, not as a fundamental enabling technology — consistent with every production MCP instance on this wiki (systems/flymcp, systems/datadog-mcp-server, systems/unity-ai-gateway, Agent Lee). Four open problems the post flags as "noodle-able solo in a basement": titrating nondeterminism vs. structured programming, connecting agents to ground truth so they can't lie to themselves about early-exit, reliable inter-agent intermediate formats (JSON / SQL / markdown summaries), and token allocation + cost containment. Canonicalises: concepts/agent-loop-stateless-llm + concepts/context-window-as-token-budget + concepts/context-engineering + concepts/sub-agent + patterns/tool-call-loop-minimal-agent + patterns/context-segregated-sub-agents. Extends concepts/agentic-development-loop with the minimal-loop foundation; extends patterns/tool-surface-minimization + patterns/specialized-agent-decomposition with the cross-agent sub-agent lever. "Your wackiest idea will probably (1) work and (2) take 30 minutes to code."
2025-10-22 — Corrosion — The canonical Corrosion deep-dive Fly.io had been promising for over a year ("Corrosion deserves its own post"). Three outages frame the post: (1) the 2024-09-01 contagious deadlock that took down Anycast globally in seconds (the bug was in fly-proxy's if let-over-RWLock consumer; Corrosion was "just a bystander" perfectly amplifying it); (2) a nullable-column DDL that forced cr-sqlite to backfill every row fleet-wide, melting the cluster; (3) a Consul mTLS cert expiry whose backoff loops on every worker saturated Fly's uplinks because the retry path wrote to Corrosion. Core architectural bet — canonical wiki anchor for concepts/no-distributed-consensus: "truly global distributed consensus promises deliciousness while yielding only immolation. Consensus protocols like Raft break down over long distances." Fly.io took cues from link-state routing protocols ( OSPF) — workers are sources of truth for their own Machines + responsible for flooding changes; Fly's fully-connected WireGuard mesh gives OSPF-style connectivity for free. Stack: SWIM membership (concepts/gossip-protocol)
QUIC for broadcast/reconciliation + systems/cr-sqlite CRDT extension for last-write-wins by logical timestamp. Thousands of workers; seconds convergence. Rejected alternatives named explicitly: Consul ("don't build a global routing system on it"), Zookeeper, etcd, Raft, rqlite ("came very close to using"), FoundationDB, S3-backed stores. Mitigations canonicalised: (i) Tokio watchdogs on every service (patterns/watchdog-bounce-on-deadlock); (ii) production adoption of Antithesis — "killer for distributed systems" — first-person confirmation of the investment JP Phillips's 2025-02-12 exit interview flagged as the external-adoption gate; (iii) checkpoint backups on object storage used "ultimately" to reboot the cluster when diagnosis exceeded restore time; (iv) eliminated partial updates in favour of whole-object republish ("we should have done it this way to begin with"); (v) regionalization project (patterns/two-level-regional-global-state) — per-region clusters + small global app→region cluster, in-progress at time of publication, response to 2024-09-01 contagious deadlock. Scope discipline reaffirmed: "not every piece of state we manage needs gossip propagation" — systems/tkdb + systems/petsem run on systems/litefs/systems/litestream not Corrosion. Open source: github.com/superfly/corrosion.
2025-06-20 — Phoenix.new: The Remote AI Runtime for Phoenix — Chris McCord (Phoenix framework creator) introduces Phoenix.new as his Fly.io skunkworks project: a browser-delivered coding agent for Elixir / Phoenix where every session is an ephemeral Fly Machine with a root shell shared between the developer and the agent, a full (not headless-only) Chrome the agent drives to verify UI changes ("instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state"), automatic *.phx.run preview URLs from any bound port via integrated port-forwarding, and the GitHub gh CLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock (four months after the 2025-02-07 VSCode-SSH-bananas sketch), canonical instance of patterns/agent-driven-headless-browser (colocated variant; sibling of the MoltWorker proxied variant), patterns/ephemeral-preview-url-via-port-forward, and patterns/agentic-pr-triage (McCord "I close my laptop, grab a cup of coffee, and wait for a PR to arrive" against phoenix-core issues). Canonical wiki statements of concepts/cloud-ide, concepts/ephemeral-dev-environment, concepts/agent-with-root-shell ("it owns the whole environment"), concepts/agent-driven-browser, concepts/ephemeral-preview-url, and concepts/async-agent-workflow — the coding-agent specialisation of the 2025-04-08 RX thesis. Three-signal closed loop (server logs + browser DOM/JS state + test output) sharpens concepts/agentic-development-loop's previous two-signal framing. System prompt tuned for Phoenix / LiveView / Channels / Presence / Ecto today; "all languages you care about are already installed" (Rails / Expo / Svelte / Go work out of the box; new framework tuning is roadmap). Tetris demo at ElixirConfEU cited as existence proof for frontier-LLM world knowledge covering surface-pattern gaps in LiveView- specific training data.
2025-10-02 — Litestream v0.5.0 is Here — Ben Johnson's shipping-announcement post for the first batch of the 2025-05-20 Litestream redesign. Confirms what actually landed in v0.5.0: the LTX file format replaces raw-WAL shipping; a three-level hierarchical compaction hierarchy (30s → 5m → 1h windows) gives "a dozen or so files on average" restore cost regardless of retention; the old "generations" abstraction ("Marvel Cinematic Universe parallel dimensions in which your database might be simultaneously living in. Yeah, we didn't like those movies much either") is fully retired in favor of monotonic transaction IDs (TXIDs) (litestream wal → litestream ltx); the LTX library now compresses per-page with an end-of-file index so individual pages can be plucked out without downloading the whole file (structural precondition for VFS read replicas); CGO is gone — switched from mattn/go-sqlite3 to modernc.org/sqlite (cross-compile- from-Mac-to-Linux Just Works); NATS JetStream joins S3 / GCS / Azure as a replica type; one-replica-destination per database codified as a hard constraint; file-format break from v0.3.x (cutover — old WAL files preserved for rollback); config file is fully backwards compatible. Pedagogic opening example (the sandwiches(id INTEGER PRIMARY KEY AUTOINCREMENT, description TEXT, star_rating INTEGER, reviewer_id INTEGER) table with reviewers dithering between ⭐ and ⭐⭐) illustrates why raw-WAL-shipping restore cost scales with "raw WAL volume" not "distinct logical state." VFS-based read replicas still not shipped — "we already have a proof of concept working"; the read-replica layer is next. HN 430 points. Canonical wiki instance of LTX compaction's concrete 30s / 5m / 1h ladder.
2025-05-28 — parking_lot: ffffffffffffffff… — Thomas Ptacek long-form debugging retrospective on a weeks-long hunt for why fly-proxy instances in European regions (overwhelmingly WAW) started locking up after Fly broadened lazy-loading of the fly-proxy Catalog — the in-memory aggregation of Corrosion2 routing state that the proxy consults to forward requests. Architectural framing: fly-proxy is the Anycast router and the hard problem is the state-distribution problem ("managing millions of connections for millions of apps", state in constant flux); Catalog is the FIB to Corrosion's RIB (updates propagate host-to-host in millisecond intervals); the long-term fix is regionalization to shrink the global broadcast domain of routing updates. Sets up two pairs of production incidents: (2024-era Round 0) a global Anycast deadlock caused by an if let read-lock-scope bug — an update about an app nobody used propagated fleet-wide in ms and deadlocked the routing layer. Canonical if-let-lock-scope-bug instance. Short-term response: a watchdog on the fly-proxy internal REPL control channel (concepts/watchdog-repl-channel) that bounces the proxy
snaps core dumps. Canonical patterns/watchdog-bounce-on-deadlock instance. (2025 Round 1-5) broadening lazy-loading exposes writer-contention + suspicious if let → Catalog lock refactor: eliminate if let-over-locks, replace RAII with explicit closures (patterns/raii-to-explicit-closure-for-lock-visibility), adopt try_write_for timeouts + labeled-log telemetry (patterns/lock-timeout-for-contention-telemetry). Still locks up. Lock timing instrumentation fires just before lockups in benign quiet applications. parking_lot's deadlock detector finds nothing. Pavel Zakharov reads core dumps: "no thread running inside the critical section… a thread waiting to acquire write lock and a bunch of threads waiting to acquire a read lock." Every single stack trace. Everyone wants the lock; nobody has it. Descent into madness: miri (finds UB in tests, fixes don't help); guard pages (never trip); wild theories (Tokio and parking_lot both ruled out); close-reading parking_lot source. Desperation probe: switch read() to read_recursive() (patterns/read-recursive-as-desperation-probe), which produces a NEW error: RwLock reader count overflow. First direct evidence of lock-word corruption. Root cause: parking_lot's RwLock state is packed into a 64-bit word (4 signal bits + 60-bit reader counter); try_write_for timeout path and reader-release unpark path both try to clear WRITER_PARKED; the atomic self-synchronizing clear (via fetch_add of two's-complement inverse) wraps instead of zeroing → lock word becomes 0xFFFFFFFFFFFFFFFF. Canonical concepts/bitwise-double-free instance. Thread 1 grabs a read lock; Thread 2 parks with timeout; Thread 1 releases, unparking Thread 2 and clearing WRITER_PARKED; Thread 2 wakes thinking its timeout fired and tries to clear WRITER_PARKED again — bitwise double-free. Fix: parking_lot PR #466
issue #465 — Fly.io's sixth patterns/upstream-the-fix instance on the wiki + the second in the Rust ecosystem (first was the 2025-02-26 rustls PR #1950). The WAW-specific timing mystery remains unresolved — "the wax and wane of caribou populations… we'll never know because we fixed the bug." Permanent debugging dividends from the arc: Catalog-wide if let-over-locks audit, RAII → closure refactor, labeled slow-write logs, last-and-current writer context tracking. Canonical wiki anchor for: 2024 fly-proxy global anycast outage + 2025 parking_lot bug + watchdog-bounce safety net + RAII-to-closure refactor + lock-timeout-for-contention telemetry + descent-into-madness debugging + bitwise double-free + parking_lot upstream fix. Tier-3 source; clears bar on production-incident + distributed-systems-internals
ecosystem-primitive + upstream-the-fix content.
2025-05-19 — Launching MCP Servers on Fly.io — Sam Ruby's "part showing off, part opinion" developer- blog announcing fly mcp launch, a new flyctl subcommand (shipped in flyctl v0.3.125) that takes any existing local / stdio-style MCP server command and one-shots it into a remote HTTP/Streamable- HTTP MCP server running as a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place for six built-in clients (Claude, Cursor, Neovim, VS Code, Windsurf, Zed), --secret KEY=value flags piped through to Machine secrets, and all Fly platform knobs (auto-stop, Flycast, Volumes, region, VM size) available. Canonical invocation:

fly mcp launch "npx -y @modelcontextprotocol/server-slack" \
  --claude --server slack \
  --secret SLACK_BOT_TOKEN=xoxb-... \
  --secret SLACK_TEAM_ID=T01234567

The post leads with the three-shape MCP-server taxonomy ("basically two types of MCP servers. One small and nimble that runs as a process on your machine. And one that is a HTTP server that runs presumably elsewhere and is standardizing on OAuth 2.1. And there is a third type, but it is deprecated.") and the client-config fragmentation complaint that names Claude Desktop's ~/Library/Application Support/Claude/claude_desktop_config.json (MCPServer key) vs Zed's ~/.config/zed/settings.json (context_servers key) vs OS-dependent per-tool variants — canonical wiki statement of concepts/mcp-client-config-fragmentation. Canonical wiki instance of systems/fly-mcp-launch + patterns/remote-mcp-server-via-platform-launcher. Pairs with the 2025-04-10 flymcp post to span both axes of MCP-server ergonomics: wrap a local CLI as a local stdio MCP server (patterns/wrap-cli-as-mcp-server) and deploy a local MCP server as a remote MCP server (patterns/remote-mcp-server-via-platform-launcher). Post also enumerates supported deployment shapes (one Machine per server / one container per server / in-process library-mode) and access paths (HTTP Authorization / WireGuard / Flycast / reverse proxies). Beta status acknowledged — "examples as shown are thought to work. Maybe."

2025-05-07 — Provisioning Machines using MCPs — Sam Ruby's short developer-blog post marking the mutation transition of Fly.io's flyctl MCP server: the read-only 2-tool prototype from a month earlier (2025-04-10) now covers the full fly volumes subcommand family (create / list / extend / show / fork / snapshots / destroy) — shipped in flyctl v0.3.117. First wiki instance of patterns/wrap-cli-as-mcp-server crossing the read-only → mutating production-resource boundary. "I created my first fly volume using an MCP … and it worked the first time. A few hours later, and with the assistance of GitHub Copilot, i added support for all fly volumes commands." Load-bearing architectural thesis paragraph for concepts/natural-language-infrastructure-provisioning: "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." Introduces the "Make it so" target UX — LLM scans code, presents a plan, human adjusts, approves, agent executes, on failure examines logs and proposes next steps. Canonical wiki statement of patterns/cli-safety-as-agent-guardrail — the CLI's pre-existing human-operator refusal invariant ("can't destroy a volume that is currently mounted") becomes the MCP server's authorization boundary for free: "Since this support is built on flyctl, I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command." Emergent resource-hygiene UX: Claude spontaneously noted unattached volumes and offered to delete the oldest on request — the agentic troubleshooting loop extended from diagnosis into provisioning-hygiene. Three-way alternative-rejection framing (HTTP Machines API, flyctl directly, web dashboard) as the concepts/agent-ergonomic-cli design-pressure confirmation. Gestures at future MCP servers running on Fly's private network — "on separate machines, or in 'sidecar' containers, or even integrated into your app" — pairing with the 2025-04-08 robot-routing / long- lived-SSE framing that gave flymcp its deployment substrate. Local-MCP security posture (concepts/local-mcp-server-risk) continues unchanged; the read-only → mutation transition raises the blast radius accordingly. Caveat: "Just be aware this is not a demo, if you ask it to destroy a volume, that operation is not reversable. Perhaps try this first on a throwaway application." Configuration template: claude_desktop_config.json snippet wires flyctl mcp server as the stdio command; MCP Inspector (local port 6274) is named as the agent-free validation surface for server authors iterating on tool schemas.
2025-05-20 — Litestream: Revamped — Ben Johnson's retrospective on the largest Litestream redesign since its 2020 launch. Three ideas imported from LiteFS: (1) the LTX file format — sorted, transaction-aware SQLite page-range changesets — replaces raw-WAL shipping; adjacent LTX files k-way-merge via LSM-style compaction, so restore to any PITR target costs the compacted state size rather than cumulative WAL volume. (2) CASAAS — Compare-and-Swap as a Service — time-based replication lease implemented via object-store conditional writes (S3's 2024-11 launch; Tigris also supports the primitive). Retires LiteFS's Consul dependency for single-leader enforcement; rolling deploys with overlapping Litestream processes against the same destination are now safe; the "generations" abstraction is collapsed to a single-generation invariant. (3) VFS-based read replicas — a SQLite Virtual Filesystem extension loaded into the application that fetches and caches pages directly from object storage; no FUSE required (LiteFS's usability wall). Works in WASM + restricted FaaS environments. Explicit trade named: "this approach isn't as efficient as a local SQLite database" — caching + prefetching are the performance knobs. Two secondary consequences: wildcard / directory replication (/data/*.db) of hundreds or thousands of SQLite databases from one Litestream process is viable for the first time — previously blocked on WAL-polling cost + slow restores. Agent-storage framing: "the robots that write LLM code are going to like SQLite too. … coding agents like Phoenix.new want [a] way to try out code on live data, screw it up, and then rollback both the code and the state. These Litestream updates put us in a position to give agents PITR as a primitive. On top of that, you can build both rollbacks and forks." Ties to the 2025-04-08 RX framing + concepts/stateful-incremental-vm-build. Forward-looking — "we're building", "should be possible" — with no production numbers; 452 HN points (item 44045292). Extends the canonical [SQLite + LiteFS
Litestream](<../patterns/sqlite-plus-litefs-plus-litestream.md>) stack (canonical at tkdb): post-revamp the three layers architecturally converge on LTX as the shared wire format. Wiki-first disclosures: LTX file format, CASAAS / object-store conditional-write lease, SQLite VFS as replication integration point, shadow WAL as legacy mechanism.
2025-04-10 — 30 Minutes With MCP and flyctl — Thomas Ptacek's internal-message-turned-blog post on building flymcp — the "most basic" MCP server for flyctl — in 30 minutes, in ~90 lines of Go on mark3labs/mcp-go. Two tools exposed: fly logs + fly status. Canonical wiki instance of patterns/wrap-cli-as-mcp-server. Load-bearing precondition: Fly.io's 2020 decision to give most flyctl commands a --json mode "to make them easier to drive from automation" (concepts/structured-output-reliability, concepts/agent-ergonomic-cli) — five-year-old automation-friendliness decision retroactively became an AI-integration-readiness decision. Pointed at unpkg (Fly-hosted npm CDN), Claude reconstructed the 10-Machine regional topology, flagged 2 machines in critical status, correlated oom_killed: true events, pulled logs on follow-up, and produced a per-second incident timeline (OOM kill → SIGKILL → reboot → health-check fail → listener up → health-check pass, ~43s end-to-end; Bun process at ~3.7 GB of 4 GB allocated) — canonical concepts/agentic-troubleshooting-loop instantiation with a deliberately-minimal tool surface (patterns/tool-surface-minimization + patterns/allowlisted-read-only-agent-actions). Ptacek: "annoyingly useful … faster than I find problems in apps." Closing caveat is the canonical wiki statement of concepts/local-mcp-server-risk — "Local MCP servers are scary. I don't like that I'm giving a Claude instance in the cloud the ability to run a native program on my machine. I think fly logs and fly status are safe, but I'd rather know it's safe. It would be, if I was running flyctl in an isolated environment and not on my local machine." — gesturing at patterns/disposable-vm-for-agentic-loop (the 2025-02-07 VSCode-SSH-bananas companion) as the answer. Third RX-era post in a ~three-day window: complements 2025-04-08 Our Best Customers Are Now Robots (RX framing
MCP-SSE routing requirement) and ties back to the 2025-02-07 VSCode-SSH-bananas disposable-VM sandbox sketch.
2025-03-27 — Operationalizing Macaroons — Thomas Ptacek's deepest architectural disclosure of Fly.io's security-token stack to date, written as Fly.io hands off internal ownership of the Macaroon project. Two years in as "the Internet's largest user of Macaroons," the user-facing pitch (end-user attenuation, emailing scoped tokens to partners) has been a mixed bag — "users don't really take advantage of token features" — but the infrastructure wins have made the token system "one of the nicer parts of our platform." Canonical wiki introduction of tkdb (~5,000 lines of Go managing a SQLite database via LiteFS + Litestream on isolated hardware in US / EU / AU; records encrypted with an injected secret; "one of the very few well-behaved" infra-SQLite databases at Fly.io) — canonical instance of patterns/isolated-token-service + patterns/sqlite-plus-litefs-plus-litestream. Transport is HTTP/Noise with Noise_IK on the verification path (TLS-analog, broad verifier set) and Noise_KK on the signing path (mTLS-analog, "only a handful" of clients with the keypair). Verification cache hit rate >98% thanks to the chained-HMAC construction's descendant-inheritance property. Revocation is the canonical feed- subscription pattern: tkdb exports a revocation- notifications endpoint, clients poll + prune caches, dump the whole cache on connectivity loss; blacklist-to-every- region explicitly rejected ("we certainly don't want to propagate the blacklist database to 35 regions around the globe"). Cosmetic logout named as the anti-pattern this design prevents. Authorization-vs-authentication split via third-party caveats; service tokens use tkdb's caveat-strip API to remove the authN caveat, and recipients further attenuate locally to bind the token to a specific flyd instance or Fly Machine — "exfiltrating it doesn't do you any good; to use it, you have to control the environment it's intended to be used in." Same caveat-for- privilege-separation pattern runs in reverse at Pet Semetary (Fly's Vault replacement): flyd's read-secret Macaroon has a third-party caveat dischargeable only by proving org permissions through tkdb. Explicit secure-design heuristic: concepts/keep-hazmat-away-from-complex-code — "root secrets for Macaroon tokens are hazmat." Telemetry: OpenTelemetry + Honeycomb + permanent-retention OpenSearch audit trail (concepts/context-propagation-otel + concepts/audit-trail-in-opensearch) — Ptacek's explicit retraction of prior OTel skepticism. Operational datum: "the tkdb code is remarkably stable and there hasn't been an incident intervention with our token system in over a year." Culture disclosure: Fly.io self-described "allergic to 'microservices'" — but "a second dedicated security service (Petsem)" alongside tkdb has "pulled its weight"; narrow-purpose security services are the carve-out exception. Closing nod to infrastructure-SQLite as a "total victory for LiteFS, Litestream" — implicit contrast to corrosion's tens-of- gigabytes operational footprint.
2025-04-08 — Our Best Customers Are Now Robots — Thomas Ptacek retrospective naming LLM-driven coding agents as the dominant growth driver on Fly.io over the prior ~6 months and introducing Robot Experience (RX) as a product-design axis alongside UX and DX. Four platform-primitive disclosures that make Fly.io "robot bait" without intending to: (1) compute lifecycle — start vs create split (start is "lightning fast … substantially faster than booting up even a non-virtualized K8s Pod"; "too subtle a distinction for humans" but "the robots are getting a lot of value out of it"); pairs with first wiki disclosure that non-GPU Machines run on Lambda's hypervisor ("Not coincidentally, our underlying hypervisor engine is the same as Lambda's" — Firecracker); Lambda-EC2 hybrid positioning ("start like it's spring-loaded, in double-digit millis […] can stick around as long as you want it to"). (2) Storage — LLMs build Machines incrementally, want a filesystem + object storage, not Postgres (concepts/stateful-incremental-vm-build; "what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."). Fly Volumes + Tigris. (3) Networking — MCP's long-lived SSE connections in multitenant deployments need session-affinity routing (concepts/mcp-long-lived-sse); Fly's dynamic request routing is "possibly a robot attractant" for exactly this shape. Canonical patterns/session-affinity-for-mcp-sse instance. (4) Secrets — tokenized OAuth tokens (concepts/tokenized-secret) let the LLM hold a placeholder while a "hardware-isolated, robot-free Fly Machine" substitutes the real credential at egress (patterns/tokenized-token-broker); grounded in Fly.io's 2024 tokenized-tokens substrate. Forward-looking note: "it should be easy to MCP our API" (not shipped at publication). DX still primary ("the most important engineering work happening today at Fly.io is still DX, not RX; it's managed Postgres (MPG)") but RX is now a first-order concern. Canonical wiki datum for concepts/robot-experience-rx + concepts/vibe-coding + patterns/start-fast-create-slow-machine-lifecycle. Platform-demand-side companion to the 2025-02-14 GPU retrospective (concepts/developers-want-llms-not-gpus). Eighth Fly.io ingest on the wiki.
2024-05-09 — Picture This: Open Source AI for Image Description — Fly.io Machines-team developer-enablement walkthrough of a weekend-scale open-source image-description service (Ollama + LLaVA + PocketBase + LangChainGo) hosted on Fly Machines. Product framing is accessibility (AI-generated alt text for blind users, screen-reader integration via NVDA); architectural substance is the GPU scale-to-zero recipe with a disclosed cold-start number. An Ollama Fly Machine on the a100-40gb preset runs LLaVA-34b behind Flycast; Fly Proxy autostart/autostop stops the Machine after ~minutes of idle silence and starts it on the next internal request from the PocketBase app-tier Machine. Canonical instance of patterns/proxy-autostop-for-gpu-cost-control + patterns/flycast-scoped-internal-inference-endpoint. Disclosed cold-start latency: "starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds." — canonical wiki datum for concepts/gpu-scale-to-zero-cold-start (three-stage budget: Machine-start seconds + model-load-into-GPU-RAM tens of seconds + first-response seconds). Post also names two model payload options on a stopped GPU Machine: Fly Volume for model weights, or bake the model into the Docker image. Side notes: context-window blow-out on the simple followup-chain ("you'll see the quality of responses get poorer — possibly incoherent — as the context exceeds the context window"); modularity claim (swap model + prompt for sentiment / joke-generation / other tasks). Scope disposition: Tier-3 borderline — hobby-project framing but on-scope as GPU-inference scale-to-zero production recipe with a real cold-start number. Sibling to the 2024-09-24 Livebook/FLAME cluster scale-to-zero post (different shape: notebook- driven cluster of 64 L40S Machines vs. single-Machine proxy-autostop here).
2025-02-26 — Taming a Voracious Rust Proxy — Production-incident retrospective on fly-proxy. Two IAD edge hosts pegged CPU + spiked HTTP errors over "some number of hours." Bouncing fly-proxy cleared it; it came back. Pavel pulled a flamegraph — dominated by Rust tracing::Subscriber (which is supposed to be very fast), signature of a spurious-wakeup busy-loop. The fully-qualified Future type in the flamegraph pointed past Fly's own wrappers (Duplex, ReusableReader, PeekableReader, MeteredIo, PermittedTcpStream) to tokio_rustls::server::TlsStream — pre-existing upstream issue tokio-rustls#72 documented exactly this pattern: on orderly TLS close_notify shutdown with still-buffered bytes on the underlying socket, TlsStream mishandles its Waker → 100% CPU. Trigger: Tigris Data load-testing — "tens of thousands of connections" with small HTTP bodies terminating early enough to set up the close_notify-before- EOF scenario. Fix: upstreamed as rustls PR #1950 — canonical Rust-ecosystem instance of patterns/upstream-the-fix + full diagnostic workflow patterns/flamegraph-to-upstream-fix. Self-drawn lessons: (1) patterns/dependency-update-discipline — "Keep your dependencies updated. Unless you shouldn't…" — the value is in the process + test infrastructure, not the updates themselves; (2) patterns/spurious-wakeup-metric — "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often." Also canonical one-line statement of fly-proxy's edge-router role: "Edges exist almost solely to run a Rust program called fly-proxy, the router at the heart of our Anycast network." First Fly.io production-incident retrospective ingested; complements the prior architectural (Making Machines Move, JIT WireGuard) + identity (AWS without access keys) posts.
2025-02-14 — We Were Wrong About GPUs — Retrospective / course-correction post by Thomas Ptacek on Fly.io's 2022-era bet on productising GPU Fly Machines. "We're not getting rid of them" + "you'll probably be waiting awhile [for a v2]" — canonical patterns/platform-retrenchment-without-customer-abandonment instance. Three load-bearing disclosures: (1) Hypervisor split — non-GPU Machines on Firecracker, GPU Machines on Intel Cloud Hypervisor (PCI passthrough). Fly rejected QEMU (ms-boot DX required) and VMware (institutional fit). (2) Nvidia driver happy-path disclosure — supported path is K8s- shared-kernel or QEMU/VMware; Fly burned months (and ultimately failed) getting virtualized-GPU drivers working on Cloud Hypervisor, including hex-editing closed-source drivers to impersonate QEMU (concepts/nvidia-driver-happy-path). (3) Demand-side diagnosis — developers don't want GPUs, they want LLMs; insurgent clouds can't compete with OpenAI / Anthropic on tokens-per-second (concepts/insurgent-cloud-constraints). Security posture load-bearing: GPUs are just-about-the-worst-case peripheral, mitigated by dedicated GPU-only worker hosts (patterns/dedicated-host-pool-for-hostile-peripheral) + two independent external audits from Atredis and Tetrel (patterns/independent-security-assessment-for-hardware-peripheral). MIG thin-slicing remained unreachable because MIG "gives you a UUID, not a PCI device." L40S customer segment persists as the one SKU that found fit. Asset-backed-bet framing (concepts/asset-backed-bet): GPU hardware is liquidatable like Fly's IPv4 portfolio, so downside is partially recoverable. Parallel drawn to Fly.io's earlier JS-edge-runtime course-correction: "we were wrong about Javascript edge functions, and I think we were wrong about GPUs." Paired with the JP Phillips exit interview two days earlier as the honest-retrospective half of Fly.io's 2025-Q1 blog posture.
2025-02-12 — The Exit Interview: JP Phillips — Q-and-A with the engineer who led flyd — Fly.io's Fly Machines orchestrator — over four years. Architectural retrospective disclosures: (1) flyd's FSM-plus-durable-steps design is ancestry-linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes/operations" (concepts/durable-execution); deploy-tolerance ("pick back up where it left off, post-deploy") is the load-bearing property; (2) JP defends BoltDB over SQLite for flyd's state store — the blast-radius-of-an- ad-hoc-SQL-statement argument plus "limiting the storage interface kept flyd's scope managed" (concepts/bolt-vs-sqlite-storage-choice); (3) alternate design JP would consider: one SQLite per Fly Machine, with schema management as the named open problem (patterns/per-instance-embedded-database); (4) pilot — Fly's new OCI-compliant init — consolidates the feature-bag init and gives flyd a formal driving API; (5) flaps — the Machines-API gateway — is named as decentralised + hits sub-5-second P90 on machine create globally (Johannesburg / Hong Kong excepted); (6) corrosion2 — Fly's SWIM-gossip CRDT-SQLite state distribution system — is JP's "most impressive thing someone else built," with TLA+/Antithesis validation named as the investment gate for external adoption (patterns/formal-methods-before-shipping); (7) OpenTelemetry + Honeycomb are load-bearing ("without oTel it'd be a disaster … I'd have ragequit trying"). Also cultural content (2023 over-hiring, GPU distraction — sibling to sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half) and a stated platform-completeness framing: "The Fly Machines platform is more or less finished … My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like that's been accomplished."
2025-02-07 — VSCode's SSH Agent Is Bananas — Architectural critique of VSCode Remote-SSH from the vantage of integrating Fly Machines into VSCode's remote-editing flow. Contrasts Emacs Tramp (lives off the land, no agent deployed) with VSCode Remote (bash stager → downloads Node.js binary + agent → WebSocket RPC over SSH port-forward → persists across reconnects; "murid in nature" — Fly calls the RAT shape). Names the agentic development loop as the 2025-motivating use-case: "close the loop between the LLM and the execution environment […] a semi-effective antidote to hallucination." Argues for disposable-VM-for-agentic-loop as the answer — "a clean-slate Linux instance that spins up instantly and that can't screw you over in any way" — with Fly Machines as the implied substrate. Productisation post deferred.
2024-09-24 — AI GPU Clusters, From Your Laptop, With Livebook — ElixirConf 2024 keynote recap. Livebook + FLAME + the Nx stack let a notebook running on a laptop drive elastic GPU-cluster compute on Fly Machines. Canonical demos: Llama-on-L40S summarising video-stills pipeline, and 64 L40S Fly Machines hyperparameter-tuning different BERT variants with per-node loss curves streamed back to the notebook in real time. The whole cluster terminates on notebook disconnect (scale-to-zero). Platform-level claim: "start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image" (concepts/seconds-scale-gpu-cluster-boot). Same runtime+FLAME integration now also runs on Kubernetes (Livebook v0.14.1, Michael Ruoss). Canonical instance of patterns/notebook-driven-elastic-compute and patterns/framework-managed-executor-pool.
2024-08-15 — We're Cutting L40S Prices In Half — GPU strategy retrospective + price cut. Customer data surprised Fly.io: the least capable GPU (A10) is the most popular by a wide margin. Fractional-A100 via MIG / vGPU + IOMMU PCI passthrough failed ("a project so cursed"); pivot to whole-GPU attachment. L40S cut to $1.25/hr (= A10 price) to collapse the inference-GPU choice. Architectural thesis: inference = transaction, training = batch; for transaction-shaped inference, the combination of GPU + RAM + Tigris + Anycast beats a bigger GPU. Canonical instance of GPU + object-storage co-location.
2024-07-30 — Making Machines Move — Year-long rebuild of fleet-drain for stateful Fly Machines with attached Fly Volumes. Introduces a clone primitive (kill → clone → boot) built on the Linux kernel's dm-clone device-mapper target; clone returns immediately, a new Machine boots on the target worker, reads of un-hydrated blocks fall through to the source worker over iSCSI (NBD tried first and abandoned — kernel threads got stuck under network disruption), and kcopyd rehydrates in background. Gnarly complications: cryptsetup version skew → LUKS2 header-size drift (4 MiB vs 16 MiB) → RPC in migration FSM to carry metadata; 6PN address-embeds-routing → migration changes addresses → Fly Postgres configs hardcoded literal addresses → in-init address-mapping bridge + fleet-wide config rewrite; Corrosion (SWIM-gossip SQLite)'s worker-is-source-of-truth invariant breaks. Ends with a nod to LSVD (NVMe-cache + object-store) as the medium-term direction. "This is the biggest thing our team has done since we replaced Nomad with flyd."
2024-06-19 — AWS without Access Keys — Fly.io's OIDC IdP (oidc.fly.io/<org>) + AssumeRoleWithWebIdentity → Fly Machines get AWS S3 access with no keypair ever stored; Fly init detects AWS_ROLE_ARN, fetches an OIDC token via /.fly/api, writes it to /.fly/oidc_token, exports AWS_WEB_IDENTITY_TOKEN_FILE for the AWS SDK. Macaroon- scoped-per-Machine identity; SSRF-resistant Unix-socket API proxy; <org>:<app>:<machine> sub-field scoping for trust policies. Canonical wiki source for OIDC federation for cloud access and patterns/oidc-role-assumption-for-cross-cloud-auth.
2024-03-12 — JIT WireGuard — Gateway-fleet WireGuard peer provisioning flipped from NATS-push to pull-on-first-packet. BPF-sniffs handshake initiations, runs ~200 lines of Noise crypto to identify the peer, pulls config from the Fly API, installs via Netlink. Kernel stale-peer count: ~550k → "rounds to none." Also documents Fly.io's broader migration off NATS for internal RPCs (flyd now HTTP).
2024-03-07 — Fly Kubernetes does more now — Fly Kubernetes launched on public beta; Pods-as-Firecracker- micro-VMs; flyd orchestrator; integration surfaces across Fly Machines.
2024-02-15 — Globally Distributed Object Storage with Tigris — Tigris public beta; architectural framing (FoundationDB + NVMe + QuiCK-style queue + S3 backend); fly storage create CLI; unified-billing framing.

Notes on tier¶

Fly.io is a Tier-3 source on the sysdesign-wiki. Per AGENTS.md, Tier-3 ingests require the post to clearly cover distributed- systems internals, scaling trade-offs, infrastructure architecture, production incidents, or similar — product-PR and feature-announcement posts are skipped. The Tigris public-beta post is borderline (it's a launch announcement) but is ingested because the three-layer architectural statement (FDB + NVMe-cache + QuiCK-queue) is load-bearing distributed-systems content.