Skip to content

Fly.io

Fly.io is an edge/region-first application platform: "we transmute containers into VMs, running them on our hardware around the world with the power of Firecracker alchemy." Tier-3 source on the sysdesign-wiki — the blog is a mix of partnership announcements, product posts, and occasional architectural retrospectives. On-scope ingests cover the architectural ones (this wiki filters out the pure-product launches per AGENTS.md Tier-3 guidance).

Key systems

  • systems/phoenix-newFly.io's browser-delivered coding agent for Elixir/Phoenix (2025-06-20, Chris McCord). Per- session Fly Machine with root shell shared between developer and a Phoenix-tuned agent, full Chrome the agent drives, *.phx.run preview URLs via integrated port-forwarding, gh CLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock, patterns/agent-driven-headless-browser (colocated variant), and patterns/agentic-pr-triage.
  • systems/phoenix-framework — Elixir web framework the Phoenix.new agent is tuned for (Channels, Presence, LiveView, Ecto). Also a hosting target for deployed Phoenix apps on Fly.io; adjacent to systems/livebook and systems/flame-elixir on the BEAM-on-Fly.io stack.
  • systems/gh-cli — GitHub CLI pre-installed in Phoenix.new session VMs; makes the "close laptop and wait for a PR" async-agent workflow executable without a GitHub-specific agent tool schema.
  • systems/tkdb — Fly.io's isolated Macaroon token authority. ~5000 lines of Go, SQLite-backed, replicated US/EU/AU via LiteFS with Litestream PITR. HTTP/Noise RPC (patterns/noise-over-http). Canonical wiki instance of patterns/isolated-token-service. DB size: "a couple dozen megs"; client verification cache hit rate >98%; 0 incident interventions in over a year.
  • systems/petsem"Pet Semetary". Fly's in-house Vault replacement; its own Macaroon authority. Uses third-party caveats for privilege separation between flyd and user secrets. Explicitly not merged with tkdb"Rule #10 and all that."
  • systems/litefs — primary/replica distributed SQLite; subsecond cross-region replication + primary-failover substrate underneath tkdb. Works with unmodified SQLite libraries. Post-2025-05-20 its LTX format and LiteVFS extension are both imported into Litestream — the two tools architecturally converge on LTX as the shared wire format.
  • systems/litestream — streaming WAL-to-object-storage PITR for SQLite; tkdb's DR substrate. Seconds-scale restore. 2025-05-20 revamp imports LiteFS's LTX file format + LSM-style compaction + a CASAAS conditional-write lease (replacing Consul for single-leader enforcement) + SQLite-VFS-based read replicas (FUSE-free). Unlocks wildcard /data/*.db replication and positions Litestream as a PITR / rollback / fork primitive for agentic coding platforms. 2025-10-02 v0.5.0 ship delivers the write/archive half: three-level hierarchical compaction (30s / 5m / 1h, restore bounded to "a dozen or so files on average"), monotonic TXIDs replacing generations (litestream wallitestream ltx), per-page compression + end-of-file index in the LTX library (random- access precondition for VFS replicas), CGO removed via modernc.org/sqlite, NATS JetStream added as a replica type, one-destination-per-database enforced, file-format break from v0.3.x. VFS read-replicas still proof-of-concept.
  • systems/macaroon-superflygithub.com/superfly/macaroon Fly.io's open-source Go Macaroon library; the substrate underneath tkdb + petsem.
  • systems/honeycomb — distributed-tracing backend Fly.io uses; Ptacek's explicit retraction of prior tracing skepticism ("I was wrong") joined with JP Phillips's "I'd have ragequit without oTel."
  • systems/opentelemetry — Fly.io's tracing standard; context propagation gives single-narrative request traces across primary API → tkdb.
  • systems/tigris — globally distributed, S3-compatible object storage (Tigris Data, Inc.), integrated into Fly.io as the fly storage create primitive. Three-layer architecture: regional FoundationDB metadata + Fly.io NVMe byte cache + QuiCK-style queuing for distribution to replicas, demand regions, and pluggable third-party backends (including S3).
  • systems/fly-machines — Fly.io's Firecracker-micro-VM compute primitive; the building block that Fly Kubernetes Pods map onto. GPUs attach via whole-GPU passthrough (fractional-GPU / MIG / vGPU path was tried and abandoned). Stateful Machines (with attached Fly Volumes) now migrate via killcloneboot per the 2024-07-30 rebuild.
  • systems/fly-volumes — Fly.io's locally-attached NVMe block-storage primitive for stateful Machines, encrypted with per-volume LUKS2 keys. The anchor point of the 2024-07-30 migration rebuild: locally-attached NVMe gave Fly.io bus-hop read latency but anchored Machines to a worker physical.
  • systems/dm-clone — Linux kernel device-mapper target powering Fly's async-clone-with-background-hydration migration.
  • systems/iscsi — Network block protocol Fly uses to expose source Volumes to target workers during migration; settled on after NBD kept getting stuck kernel threads under network disruption.
  • systems/nbd — Fly's first attempt at a network block protocol; abandoned.
  • systems/dm-crypt-luks2 / systems/cryptsetup — How Fly Volumes are encrypted; cryptsetup version skew across the fleet causes LUKS2 header-size drift (4 MiB vs 16 MiB), which required a metadata RPC in flyd's migration FSM.
  • systems/linux-device-mapper — Kernel block-layer proxy; the substrate for dm-clone + dm-crypt.
  • systems/corrosion-swim — Fly's SWIM-gossip CRDT-SQLite state distribution system (corrosion2 per the 2025-02-12 exit interview). Rust. "Any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world." Migration broke its worker-as-source-of-truth invariant and forced a redesign. The 2025-05-28 parking_lot post clarifies the architectural relationship: Corrosion is the RIB to fly-proxy's in-memory Catalog FIB, with update propagation in "millisecond intervals of time" host-to-host. The 2025-10-22 dedicated deep-dive — the "deserves its own post" Fly.io had been promising — fills in the mechanism (OSPF-inspired link-state design, SWIM membership + QUIC reconciliation + systems/cr-sqlite CRDT with LWW-by-logical-timestamp), the explicit anti-consensus posture (concepts/no-distributed-consensus), the rejected alternatives (Consul, rqlite, FoundationDB), the three disclosed outages (contagious deadlock, nullable-column backfill, Consul cert-expiry → uplink saturation), and the five-mitigation response (fleet-wide Tokio watchdogs, Antithesis adoption, checkpoint-backups-to-object-storage as break-glass, eliminate-partial-updates with whole-object republish, and the two-level regional + global state regionalization project). Open-sourced: github.com/superfly/corrosion.
  • systems/cr-sqlite — CRDT SQLite extension from vlcn-io; the conflict-resolution substrate under Corrosion. Tracks CRDT-managed table changes in crsql_changes; applies updates last-write-wins using logical timestamps. Known failure mode: nullable-column backfill amplification on large tables.
  • systems/consulrejected predecessor to Corrosion; HashiCorp's service-discovery + KV store built on Raft. "Consul is fantastic software. Don't build a global routing system on it." Also the distal trigger of the 2024 uplink-saturation outage (mTLS cert expiry → fleetwide backoff loops → Corrosion write storm).
  • systems/parking-lot-rustAmanieu's replacement for Rust std::sync locks (Mutex / RwLock / Condvar / Once). Not Fly.io-authored, but load-bearing in fly-proxy's Catalog and the subject of Fly.io's sixth upstream-the-fix contribution (parking_lot PR #466). 64-bit compact lock word (4 signal bits + 60-bit reader counter); try_write_for(Duration); read_recursive; deadlock detector. Canonical wiki anchor for concepts/bitwise-double-free bug class.
  • systems/lsvd — Log-structured virtual disk; Fly's stated medium-term storage direction — NVMe-as-cache in front of regional Tigris S3.
  • systems/nomad — Fly's pre-flyd orchestrator; referenced as the baseline against which flyd — and the 2024 migration rebuild — are sized ("the biggest thing our team has done since we replaced Nomad with flyd").
  • systems/nvidia-a10 / systems/nvidia-l40s / systems/nvidia-a100 / systems/nvidia-h100 — the four NVIDIA GPU models Fly.io stocks. Customer-usage data (2024) revealed the A10 — the least capable — as the most popular by a wide margin; the 2024-08-15 L40S price cut to $1.25/hr (A10 price) was engineered to collapse the choice into a single inference default.
  • systems/nvidia-mig — NVIDIA's fractional-GPU primitive; Fly.io tried and abandoned productising it inside Firecracker Machines via IOMMU PCI passthrough.
  • systems/fly-kubernetes — Fly.io's managed Kubernetes distribution where every Pod is a Fly Machine (Firecracker micro-VM) orchestrated by flyd rather than containerd/runc.
  • systems/flyd — Fly.io's orchestrator that schedules Firecracker-backed Pods. Durable-FSM design (per-step records in BoltDB) lineage-linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes/operations" per JP Phillips's 2025-02-12 exit interview.
  • systems/flaps — the Machines-API gateway routing incoming HTTPS into per-host flyd RPCs. Decentralised ("for the most part doesn't require any central coordination"), sub-5-second P90 on machine create globally (Johannesburg and Hong Kong excepted). JP's "whole Fly Machines API" framing.
  • systems/fly-pilot — 2025 successor to init. OCI-compliant runtime with a defined API for flyd to drive; consolidates the feature-bag init described in the 2024-06-19 AWS-without-Access-Keys post. Third of Fly.io's three Rust services (after fly-proxy + corrosion2).
  • systems/boltdb — flyd's state store. Deliberate non-SQL pick for blast-radius safety + scope discipline; JP's 2025-02-12 defence: "I've never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet."
  • systems/cadence — HashiCorp-era durable-workflow engine; direct design-ancestry cite for flyd's FSM design via JP Phillips. Not a Fly.io runtime dependency.
  • systems/firecracker — Fly.io runs user workloads on AWS Firecracker micro-VMs. Substrate/context — Fly.io itself is not the primary wiki source for Firecracker (that's AWS Lambda + Figma), but it is named as Fly.io's isolation layer in every Fly.io source.
  • systems/intel-cloud-hypervisor — Fly.io's GPU-Machine hypervisor. A "very similar Rust codebase" to Firecracker that supports PCI passthrough; non-GPU Machines run on Firecracker, GPU Machines run on Cloud Hypervisor. First wiki appearance via the 2025-02-14 GPU retrospective.
  • systems/qemu — the conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io rejected it on millisecond-boot DX grounds. Wiki touchpoint.
  • systems/vmware — the other conventional-hypervisor alternative on Nvidia's driver happy path. Fly.io explicitly rejected it ("Nvidia suggested VMware (heh)"). Wiki touchpoint.
  • systems/virtual-kubelet — CNCF-sandbox Virtual Kubelet provider is the pivot that lets FKS run K8s without Nodes; Fly runs a small Go provider alongside K3s.
  • systems/k3s — the lightweight K8s distribution FKS uses for the control-plane API surface.
  • systems/fly-proxy — Fly's edge / private-network proxy; backs K8s Services under FKS.
  • systems/fly-wireguard-mesh — internal IPv6 WireGuard mesh (6PN) that replaces CNI under FKS and connects every Fly Machine across hosts / regions.
  • systems/flycast*.flycast private-network hostnames; one of the three FKS Service access paths.
  • systems/fly-gateway — regional fleet of servers that terminate external customer WireGuard connections from flyctl. Separate substrate from the internal 6PN mesh; the subject of the 2024-03-12 JIT WireGuard rewrite.
  • systems/wggwd — gateway-side daemon that manages WireGuard interfaces; pull-on-demand peer provisioner post-2024-03-12.
  • systems/wireguard — underlying protocol for both the internal (6PN) and external (gateway) meshes.
  • systems/fly-flyctl — Fly.io's CLI; conjures a TCP/IP stack per invocation and speaks WireGuard to Fly Machines.
  • systems/fly-graphql-api — Fly.io's customer-facing control plane; formerly pushed peer configs to gateways, now serves pull requests on handshake arrival.
  • systems/oidc-fly-io — Fly.io's in-house OpenID Connect identity provider (oidc.fly.io/<org>). Issues short-lived OIDC JWTs exclusively to Fly Machines, with a structured sub claim of shape <org>:<app>:<machine>. Lets counterparties (AWS, GCP, Azure, any OIDC-compliant cloud) trust Fly Machines as federated identities without any long-lived credential ever being shared. Canonical wiki instance of workload identity.
  • systems/fly-init — the Rust-written process-zero binary in every Fly Machine. Hosts a Unix-socket API proxy at /.fly/api (Fly's answer to the EC2 Instance Metadata Service, but SSRF-resistant by design) and acts as the credential broker for AWS OIDC federation: detects AWS_ROLE_ARN at boot, fetches an OIDC token, writes it to /.fly/oidc_token, and exports AWS_WEB_IDENTITY_TOKEN_FILE for the AWS SDK's standard credential-provider chain. As of 2025, succeeded by pilot — a full OCI runtime with a formal flyd-driven API — consolidating init's feature bag.
  • systems/fly-proxy — Fly.io's Rust edge router; one of the three Rust services on Fly's platform (alongside corrosion2 and pilot). Edge servers "exist almost solely" to run it. Built on Tokio + tokio-rustls; terminates TLS, handles HTTP routing decisions, forwards to worker-hosted Fly Machines over Fly's Anycast fabric. Canonical wiki Seen-in: the 2025-02 IAD CPU-busy-loop incident traced to a tokio-rustls TlsStream Waker bug under close_notify with buffered trailer.
  • systems/rustls / systems/tokio-rustls / systems/tokio — the Rust async / TLS stack fly-proxy is built on. Rustls is "an ultra-important, load-bearing function in the Rust ecosystem"; Fly.io contributed the 2025-02 upstream fix (rustls PR #1950) — canonical Rust-ecosystem instance of patterns/upstream-the-fix.
  • systems/livebook / FLAME / Nx / BEAM — the Elixir-ecosystem pieces that, with Fly Machines as the executor substrate, compose into notebook-driven elastic GPU compute. FLAME Flame.call blocks pool executors on Fly Machines; Livebook drives them from a laptop; Nx/Axon/ Bumblebee supply the GPU-backed AI/ML primitives; BEAM's native code distribution makes notebook-defined modules executable across the cluster. The Kubernetes-side runtime
  • FLAME port (Livebook v0.14.1, Michael Ruoss) confirms the pattern is substrate-independent.
  • systems/tokenized-tokens — Fly.io's secret-tokenization system (2024 post referenced in the 2025-04-08 "Our Best Customers Are Now Robots" post). Hardware- isolated, "robot-free" Fly Machines hold real OAuth / API credentials and substitute them for placeholder tokens at egress; the LLM client never touches the real secret. Canonical wiki substrate for patterns/tokenized-token-broker and concepts/tokenized-secret.
  • systems/model-context-protocol — the open LLM interop protocol Fly.io names in the 2025-04-08 post. Modern MCP uses long-lived SSE connections; multitenant MCP deployments need session-affinity routing; Fly's dynamic request routing is the platform-level answer. Canonical wiki datum on MCP-SSE-as-routing-requirement is concepts/mcp-long-lived-sse.
  • systems/flymcp — Fly.io's open-source github.com/superfly/flymcp "most basic" MCP server for flyctl. ~90 lines of Go, 2 tools (fly logs + fly status), MCP stdio transport, built in "30 minutes". Canonical wiki instance of patterns/wrap-cli-as-mcp-server — the pattern works because flyctl --json was already done in 2020. Demonstration of an agentic-incident-diagnosis loop against unpkg; surfaces concepts/local-mcp-server-risk as the structural concern. (Source: sources/2025-04-10-flyio-30-minutes-with-mcp-and-flyctl.)
  • systems/fly-mcp-launchfly mcp launch flyctl subcommand (shipped flyctl v0.3.125, 2025-05-19). Takes any stdio MCP server command and deploys it as a remote HTTP MCP server running on a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place across 6 supported clients (Claude / Cursor / Neovim / VS Code / Windsurf / Zed), and --secret flags piped through to Machine secrets. Canonical wiki instance of patterns/remote-mcp-server-via-platform-launcher. Pairs with flymcp to span local ↔ remote MCP-server deployment axis. (Source: sources/2025-05-19-flyio-launching-mcp-servers-on-flyio.)
  • systems/aws-lambda — positional comparator. The 2025-04-08 post discloses that Fly Machines (non-GPU) run on Lambda's hypervisor — Firecracker. "Not coincidentally, our underlying hypervisor engine is the same as Lambda's." Fly's value-add is Lambda-like start latency plus EC2-like runtime duration + stateful filesystem persistence across stop/start cycles.

Key patterns / concepts

Production-incident debugging (2025-02-26 fly-proxy Rust-TLS incident)

Storage + migration (2024-07-30 rebuild)

Notebook-driven elastic GPU compute (2024-09-24 keynote)

GPU scale-to-zero (single-Machine) — 2024-05-09 image-description walkthrough

  • patterns/proxy-autostop-for-gpu-cost-control — Canonical Fly.io instance. Fly Proxy owns start/stop of a GPU Fly Machine: stops on idle silence (minutes-scale), starts on next internal request. App tier never decides; proxy does. The load-bearing cost-control primitive that makes single-Machine GPU inference hobby-project-affordable on a cloud.
  • patterns/flycast-scoped-internal-inference-endpoint — The access-scoping pre-requisite that makes autostop's "idle" definition meaningful. Flycast hostname scopes inference-tier access to same-org 6PN traffic only, so random internet scans don't wake the GPU Machine.
  • concepts/gpu-scale-to-zero-cold-start — The tail pattern-eats: three-stage cold-start budget (Machine-start seconds + model-load-into-GPU-RAM tens of seconds + first-response seconds) disclosed as ~45 seconds on a100-40gb + LLaVA-34b. Different dominant stage from CPU/serverless cold-start.

Remote development / agentic loops (2025-02-07 commentary)

GPU product retrenchment (2025-02-14 retrospective)

  • concepts/developers-want-llms-not-gpus — Fly.io's canonical demand-side diagnosis. "Developers don't want GPUs. They don't even want AI/ML models. They want LLMs." The 10,000-vs-5-6-developer credo applied to GPU Machines — GPU workloads land on the 5-6 side. Fly.io's canonical wiki instance.
  • concepts/gpu-as-hostile-peripheral — the security framing that shaped GPU Machines' productisation. "A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers, with arbitrary, end-user controlled computation, all operating outside our normal security boundary." Canonical wiki instance.
  • concepts/nvidia-driver-happy-path — Fly.io canonically discloses the shape of the Nvidia driver happy path (K8s with shared kernel, or QEMU/VMware) and the cost of deviating. Months of failed Cloud Hypervisor integration work; hex-edited closed-source drivers to impersonate QEMU; no MIG path to thin-sliced GPUs because MIG presents as a UUID not a PCI device. Canonical wiki instance.
  • concepts/fast-vm-boot-dx — the DX property Fly.io refused to trade for Nvidia-driver-happy-path compatibility. "We could not have offered our desired Developer Experience on the Nvidia happy-path." Canonical wiki statement.
  • concepts/asset-backed-bet — Fly.io's risk-framing: GPUs are tradable assets with durable value, so the downside of being wrong about the GPU bet is partially recoverable via liquidation. Companion to the IPv4 address portfolio framing. Canonical wiki instance.
  • concepts/insurgent-cloud-constraints — the broader structural framing for why Fly.io can't compete with OpenAI/Anthropic on the LLM-serving axis. Canonical wiki statement.
  • concepts/product-market-fit — the meta-framing: "a startup is a race to learn stuff." Course-correction without shame when a bet doesn't hit PMF.
  • patterns/dedicated-host-pool-for-hostile-peripheral — Fly's GPU Machines run on dedicated GPU-only workers on Cloud Hypervisor; non-GPU Machines run on Firecracker workers. Isolation posture: peripheral-class segregation at the placement tier. Canonical wiki instance.
  • patterns/independent-security-assessment-for-hardware-peripheral — Fly.io's GPU deployment was cleared by two independent external security audits (Atredis, Tetrel) before productisation. "They were not cheap, and they took time." Canonical wiki instance.
  • patterns/platform-retrenchment-without-customer-abandonment — Fly.io's 2025-02-14 announcement "if you're using Fly GPU Machines, don't freak out; we're not getting rid of them. But if you're waiting for us to do something bigger with them, a v2 of the product, you'll probably be waiting awhile." Keep-running + no-v2 = retrenchment without abandonment. Canonical wiki instance.

Robot Experience (RX) / robots-as-customers (2025-04-08)

  • concepts/robot-experience-rx — Fly.io introduces RX (Robot Experience) as a product-design axis alongside UX and DX in the 2025-04-08 post. "One of our north stars has always been nailing the DX of a public cloud. But the robots aren't going anywhere. It's time to start thinking about what it means to have a good RX. […] We have not yet nailed the RX; nobody has. But it's an interesting question." Canonical wiki instance of the framing.
  • concepts/vibe-coding — Fly.io's gloss on the LLM-driven conversational code-generation workflow: bursty-then-idle ("frenzy of activity for a minute or so, but then chill out for minutes, hours, or days"). The canonical wiki workload-shape phrasing.
  • concepts/fly-machine-start-vs-create — the lifecycle-split primitive robots consume and humans don't grok. "Start is lightning fast; substantially faster than booting up even a non-virtualized K8s Pod. This is too subtle a distinction for humans, who (reasonably!) just mash the create button to boot apps up in Fly Machines. But the robots are getting a lot of value out of it." Canonical wiki instance.
  • concepts/stateful-incremental-vm-build — the robot-workload storage shape that forces filesystem + object-storage primitives over the Postgres-first human default. "As product thinkers, our intuition about storage is 'just give people Postgres'. […] But because LLMs are doing the Cursed and Defiled Root Chalice Dungeon version of app construction, what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."
  • concepts/mcp-long-lived-sse — the networking-tier reason Fly.io's Fly Proxy dynamic request routing is "a robot attractant" — multitenant MCP SSE deployments require session-affinity routing.
  • concepts/tokenized-secret — the identity-plane RX primitive: LLMs get a placeholder that a hardware-isolated Fly Machine substitutes for the real OAuth token at egress. Grounded in Fly.io's 2024 tokenized-tokens substrate.
  • patterns/start-fast-create-slow-machine-lifecycle — expose two lifecycle paths (slow create, fast start) so bursty-then-idle workloads resume at invocation latency from idle. Canonical wiki instance.
  • patterns/session-affinity-for-mcp-sse — route long- lived MCP SSE connections from a given client back to the same stateful instance. Fly.io instantiates via tenant-controlled dynamic request routing; canonical wiki instance.
  • patterns/tokenized-token-broker — hardware-isolated broker substitutes real secrets for placeholders at egress; adjacent to Fly's init-as-credential- broker pattern (STS / OIDC federation variant) and extends it to arbitrary OAuth-style credentials. Canonical wiki instance.

Wrap-CLI-as-MCP / local-MCP-risk (2025-04-10 → 2025-05-07)

  • patterns/wrap-cli-as-mcp-server — canonical wiki pattern (Fly.io's flymcp: 90 LoC Go, 2 tools, stdio transport, 30 minutes). Viable because flyctl --json was already done in 2020; pass-through CLI credentials; no auth/transport layer. Demonstrated generalisable via the unpkg incident-diagnosis flow. 2025-05-07 mutation transition: same server, now full fly volumes CRUD; first wiki instance of the pattern crossing the read-only → production-mutation boundary.
  • concepts/local-mcp-server-risk — canonical wiki statement of the "giving a cloud LLM the ability to run a native program on my machine" concern. Ptacek: "Local MCP servers are scary." The 2025-05-07 post inherits the posture with mutation authority; "if you ask it to destroy a volume, that operation is not reversable." Mitigation: patterns/disposable-vm-for-agentic-loop (run the wrapped CLI inside a throwaway Fly Machine rather than on the operator's laptop — sketched 2 months earlier in the 2025-02-07 VSCode-SSH-bananas post).
  • concepts/structured-output-reliability — extended with the upstream variant: Fly.io's 2020 --json decision is the producer-side instance of structured-output discipline that made LLM-consumer tooling trivially viable 5 years later. Different shape from the Dash-judge case (LLM as producer there, LLM as consumer here), same underlying lesson.
  • concepts/agent-ergonomic-cli — cross-vendor confirmation of the Cloudflare framing: flyctl's structured-output axis predates and survives the LLM era as a general automation property LLMs retroactively weaponise. 2025-05-07 extends with the three-way alternative-rejection framing (API / CLI / dashboard all lose to NL + MCP).
  • concepts/natural-language-infrastructure-provisioning — canonical wiki thesis. "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." "Make it so" as the target UX.
  • patterns/plan-then-apply-agent-provisioning — the aspirational UX the 2025-05-07 post sketches — LLM scans code, presents a plan, human adjusts + approves, agent executes, on failure examines logs and proposes next steps. Terraform's plan / apply discipline reimplemented as a conversation. Not yet shipped in flyctl v0.3.117; roadmap target.
  • patterns/cli-safety-as-agent-guardrail — the mounted-volume-refusal invariant as the zero-cost guardrail that let Fly.io safely ship a mutation-authority MCP surface. "I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command." Mutation-side twin of the --json-as-load-bearing observation — mature CLI design pays an AI-integration dividend the original authors never intended.

Cloud IDE for coding agents (2025-06-20 Phoenix.new)

Minimal-agent-loop pedagogy (2025-11-06 Ptacek essay)

Canonical statements of agent-architecture vocabulary the wiki had in use but un-named. From "You Should Write An Agent":

  • concepts/agent-loop-stateless-llm"The LLM itself is a stateless black box. The conversation we're having is an illusion we cast, on ourselves." Canonical 15-LoC Python statement of the primitive every agent on the wiki composes over.
  • concepts/context-window-as-token-budget"You're allotted a fixed number of tokens in any context window. Each input you feed in, each output you save, each tool you describe, and each tool output eats tokens." Independent confirmation of the context-window-as-budget framing also surfaced at Dropbox Dash (2025-11-17) and Datadog (2026-03-04). Degradation is nondeterministic: "the whole system begins getting nondeterministically stupider."
  • concepts/context-engineering"Turns out: context engineering is a straightforwardly legible programming problem. […] If Context Engineering was an Advent of Code problem, it'd occur mid-December. It's programming." Canonical statement repudiating the "magic spells" framing of prompt engineering.
  • concepts/sub-agent"Just a new context array, another call to the model. Give each call different tools." Demystifies Claude Code's sub-agents primitive; complementary to existing patterns/specialized-agent-decomposition + patterns/coordinator-sub-reviewer-orchestration.
  • patterns/tool-call-loop-minimal-agent — ~30-LoC Python teaching shape for every tool-using agent on the wiki. Emergent multi-step planning: "Did you notice where I wrote the loop in this agent to go find and ping multiple Google properties? Yeah, neither did I."
  • patterns/context-segregated-sub-agents — security-, budget-, and specialisation-motivated sub-agent pattern. "You can trivially build an agent with segregated contexts, each with specific tools. That makes LLM security interesting."

The essay is also the wiki's first explicit MCP-is-optional framing (extended onto the MCP system page): when you own both the agent and the tools, native tool-schema JSON against the LLM endpoint is sufficient — MCP earns its place as an interop layer for tools consumed by agents someone else built.

Architectural framing

Fly.io's self-description positions it explicitly as a compute + networking platform, with adjacent concerns (storage, databases, object storage) delivered via partner integrations rather than in-house builds. Tigris is the canonical example of this model on the object-storage axis: "we partnered with Tigris, so that they can put their full resources into making object storage as magical as Fly.io is." The blog frames this as a Unix-philosophy posture — "you have individual parts that do one thing very well that are then chained together to create a composite result." The customer-facing trade is that all the parts bill through Fly.io.

The Tigris integration also shows Fly.io willing to be the substrate for the partner (Tigris runs on Fly.io's NVMe volumes and regions), not just a customer. That's a different shape from the typical cloud-partnership pattern of "we wire up your SDK" — Fly.io is renting Tigris the hardware.

Recent articles

  • 2025-11-06 — You Should Write An AgentThomas Ptacek's pedagogical essay canonicalising agent-architecture vocabulary the wiki had in use but un-named. An agent is an HTTP client against one endpoint, a Python list as "context", and a while loop — "It's incredibly easy." Four demonstrations build up in ~60 lines of Python: a 15-LoC ChatGPT clone exposing the stateless-LLM + replayed- context illusion; an Alph / Ralph truth/lies personality split showing two context arrays cost the same as one; a three-function upgrade to a tool-using agent that "figures out" multi-step ping probing of google.com / www.google.com / 8.8.8.8 without the author writing the loop (patterns/tool-call-loop-minimal-agent); and a design-space survey covering sub-agents ("just a new context array, another call to the model"), summarisation-as-compression, and concepts/context-engineering as a "straightforwardly legible programming problem". Canonical statement of concepts/context-window-as-token-budget — tool descriptions, tool outputs, and stored replies all compete for the same token budget; past a threshold "the whole system begins getting nondeterministically stupider." Also the wiki's first explicit MCP-is-optional framing: "we didn't need MCP at all. […] MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don't control. Write your own agent. Be a programmer. Deal in APIs, not plugins." Positions MCP correctly as an interop layer for tools consumed by agents-someone-else-built, not as a fundamental enabling technology — consistent with every production MCP instance on this wiki (systems/flymcp, systems/datadog-mcp-server, systems/unity-ai-gateway, Agent Lee). Four open problems the post flags as "noodle-able solo in a basement": titrating nondeterminism vs. structured programming, connecting agents to ground truth so they can't lie to themselves about early-exit, reliable inter-agent intermediate formats (JSON / SQL / markdown summaries), and token allocation + cost containment. Canonicalises: concepts/agent-loop-stateless-llm + concepts/context-window-as-token-budget + concepts/context-engineering + concepts/sub-agent + patterns/tool-call-loop-minimal-agent + patterns/context-segregated-sub-agents. Extends concepts/agentic-development-loop with the minimal-loop foundation; extends patterns/tool-surface-minimization + patterns/specialized-agent-decomposition with the cross-agent sub-agent lever. "Your wackiest idea will probably (1) work and (2) take 30 minutes to code."

  • 2025-10-22 — CorrosionThe canonical Corrosion deep-dive Fly.io had been promising for over a year ("Corrosion deserves its own post"). Three outages frame the post: (1) the 2024-09-01 contagious deadlock that took down Anycast globally in seconds (the bug was in fly-proxy's if let-over-RWLock consumer; Corrosion was "just a bystander" perfectly amplifying it); (2) a nullable-column DDL that forced cr-sqlite to backfill every row fleet-wide, melting the cluster; (3) a Consul mTLS cert expiry whose backoff loops on every worker saturated Fly's uplinks because the retry path wrote to Corrosion. Core architectural bet — canonical wiki anchor for concepts/no-distributed-consensus: "truly global distributed consensus promises deliciousness while yielding only immolation. Consensus protocols like Raft break down over long distances." Fly.io took cues from link-state routing protocols ( OSPF) — workers are sources of truth for their own Machines + responsible for flooding changes; Fly's fully-connected WireGuard mesh gives OSPF-style connectivity for free. Stack: SWIM membership (concepts/gossip-protocol)

  • QUIC for broadcast/reconciliation + systems/cr-sqlite CRDT extension for last-write-wins by logical timestamp. Thousands of workers; seconds convergence. Rejected alternatives named explicitly: Consul ("don't build a global routing system on it"), Zookeeper, etcd, Raft, rqlite ("came very close to using"), FoundationDB, S3-backed stores. Mitigations canonicalised: (i) Tokio watchdogs on every service (patterns/watchdog-bounce-on-deadlock); (ii) production adoption of Antithesis"killer for distributed systems" — first-person confirmation of the investment JP Phillips's 2025-02-12 exit interview flagged as the external-adoption gate; (iii) checkpoint backups on object storage used "ultimately" to reboot the cluster when diagnosis exceeded restore time; (iv) eliminated partial updates in favour of whole-object republish ("we should have done it this way to begin with"); (v) regionalization project (patterns/two-level-regional-global-state) — per-region clusters + small global app→region cluster, in-progress at time of publication, response to 2024-09-01 contagious deadlock. Scope discipline reaffirmed: "not every piece of state we manage needs gossip propagation"systems/tkdb + systems/petsem run on systems/litefs/systems/litestream not Corrosion. Open source: github.com/superfly/corrosion.
  • 2025-06-20 — Phoenix.new: The Remote AI Runtime for Phoenix — Chris McCord (Phoenix framework creator) introduces Phoenix.new as his Fly.io skunkworks project: a browser-delivered coding agent for Elixir / Phoenix where every session is an ephemeral Fly Machine with a root shell shared between the developer and the agent, a full (not headless-only) Chrome the agent drives to verify UI changes ("instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state"), automatic *.phx.run preview URLs from any bound port via integrated port-forwarding, and the GitHub gh CLI pre-installed. Canonical productised instance of patterns/ephemeral-vm-as-cloud-ide, cr-sqlite, quic, link-state, ospf, no-consensus, rib-fib, two-level-state, nullable-column-backfill, uplink-saturation, antithesis, multiverse-debugging, tokio-watchdog, checkpoint-backup, eliminate-partial-updates, whole-object-republish, consul-rejected, rqlite-rejected, contagious-deadlock (four months after the 2025-02-07 VSCode-SSH-bananas sketch), canonical instance of patterns/agent-driven-headless-browser (colocated variant; sibling of the MoltWorker proxied variant), patterns/ephemeral-preview-url-via-port-forward, and patterns/agentic-pr-triage (McCord "I close my laptop, grab a cup of coffee, and wait for a PR to arrive" against phoenix-core issues). Canonical wiki statements of concepts/cloud-ide, concepts/ephemeral-dev-environment, concepts/agent-with-root-shell ("it owns the whole environment"), concepts/agent-driven-browser, concepts/ephemeral-preview-url, and concepts/async-agent-workflow — the coding-agent specialisation of the 2025-04-08 RX thesis. Three-signal closed loop (server logs + browser DOM/JS state + test output) sharpens concepts/agentic-development-loop's previous two-signal framing. System prompt tuned for Phoenix / LiveView / Channels / Presence / Ecto today; "all languages you care about are already installed" (Rails / Expo / Svelte / Go work out of the box; new framework tuning is roadmap). Tetris demo at ElixirConfEU cited as existence proof for frontier-LLM world knowledge covering surface-pattern gaps in LiveView- specific training data.

  • 2025-10-02 — Litestream v0.5.0 is Here — Ben Johnson's shipping-announcement post for the first batch of the 2025-05-20 Litestream redesign. Confirms what actually landed in v0.5.0: the LTX file format replaces raw-WAL shipping; a three-level hierarchical compaction hierarchy (30s → 5m → 1h windows) gives "a dozen or so files on average" restore cost regardless of retention; the old "generations" abstraction ("Marvel Cinematic Universe parallel dimensions in which your database might be simultaneously living in. Yeah, we didn't like those movies much either") is fully retired in favor of monotonic transaction IDs (TXIDs) (litestream wallitestream ltx); the LTX library now compresses per-page with an end-of-file index so individual pages can be plucked out without downloading the whole file (structural precondition for VFS read replicas); CGO is gone — switched from mattn/go-sqlite3 to modernc.org/sqlite (cross-compile- from-Mac-to-Linux Just Works); NATS JetStream joins S3 / GCS / Azure as a replica type; one-replica-destination per database codified as a hard constraint; file-format break from v0.3.x (cutover — old WAL files preserved for rollback); config file is fully backwards compatible. Pedagogic opening example (the sandwiches(id INTEGER PRIMARY KEY AUTOINCREMENT, description TEXT, star_rating INTEGER, reviewer_id INTEGER) table with reviewers dithering between ⭐ and ⭐⭐) illustrates why raw-WAL-shipping restore cost scales with "raw WAL volume" not "distinct logical state." VFS-based read replicas still not shipped"we already have a proof of concept working"; the read-replica layer is next. HN 430 points. Canonical wiki instance of LTX compaction's concrete 30s / 5m / 1h ladder.

  • 2025-05-28 — parking_lot: ffffffffffffffff… — Thomas Ptacek long-form debugging retrospective on a weeks-long hunt for why fly-proxy instances in European regions (overwhelmingly WAW) started locking up after Fly broadened lazy-loading of the fly-proxy Catalog — the in-memory aggregation of Corrosion2 routing state that the proxy consults to forward requests. Architectural framing: fly-proxy is the Anycast router and the hard problem is the state-distribution problem ("managing millions of connections for millions of apps", state in constant flux); Catalog is the FIB to Corrosion's RIB (updates propagate host-to-host in millisecond intervals); the long-term fix is regionalization to shrink the global broadcast domain of routing updates. Sets up two pairs of production incidents: (2024-era Round 0) a global Anycast deadlock caused by an if let read-lock-scope bug — an update about an app nobody used propagated fleet-wide in ms and deadlocked the routing layer. Canonical if-let-lock-scope-bug instance. Short-term response: a watchdog on the fly-proxy internal REPL control channel (concepts/watchdog-repl-channel) that bounces the proxy

  • snaps core dumps. Canonical patterns/watchdog-bounce-on-deadlock instance. (2025 Round 1-5) broadening lazy-loading exposes writer-contention + suspicious if let → Catalog lock refactor: eliminate if let-over-locks, replace RAII with explicit closures (patterns/raii-to-explicit-closure-for-lock-visibility), adopt try_write_for timeouts + labeled-log telemetry (patterns/lock-timeout-for-contention-telemetry). Still locks up. Lock timing instrumentation fires just before lockups in benign quiet applications. parking_lot's deadlock detector finds nothing. Pavel Zakharov reads core dumps: "no thread running inside the critical section… a thread waiting to acquire write lock and a bunch of threads waiting to acquire a read lock." Every single stack trace. Everyone wants the lock; nobody has it. Descent into madness: miri (finds UB in tests, fixes don't help); guard pages (never trip); wild theories (Tokio and parking_lot both ruled out); close-reading parking_lot source. Desperation probe: switch read() to read_recursive() (patterns/read-recursive-as-desperation-probe), which produces a NEW error: RwLock reader count overflow. First direct evidence of lock-word corruption. Root cause: parking_lot's RwLock state is packed into a 64-bit word (4 signal bits + 60-bit reader counter); try_write_for timeout path and reader-release unpark path both try to clear WRITER_PARKED; the atomic self-synchronizing clear (via fetch_add of two's-complement inverse) wraps instead of zeroing → lock word becomes 0xFFFFFFFFFFFFFFFF. Canonical concepts/bitwise-double-free instance. Thread 1 grabs a read lock; Thread 2 parks with timeout; Thread 1 releases, unparking Thread 2 and clearing WRITER_PARKED; Thread 2 wakes thinking its timeout fired and tries to clear WRITER_PARKED again — bitwise double-free. Fix: parking_lot PR #466
  • issue #465 — Fly.io's sixth patterns/upstream-the-fix instance on the wiki + the second in the Rust ecosystem (first was the 2025-02-26 rustls PR #1950). The WAW-specific timing mystery remains unresolved — "the wax and wane of caribou populations… we'll never know because we fixed the bug." Permanent debugging dividends from the arc: Catalog-wide if let-over-locks audit, RAII → closure refactor, labeled slow-write logs, last-and-current writer context tracking. Canonical wiki anchor for: 2024 fly-proxy global anycast outage + 2025 parking_lot bug + watchdog-bounce safety net + RAII-to-closure refactor + lock-timeout-for-contention telemetry + descent-into-madness debugging + bitwise double-free + parking_lot upstream fix. Tier-3 source; clears bar on production-incident + distributed-systems-internals
  • ecosystem-primitive + upstream-the-fix content.

  • 2025-05-19 — Launching MCP Servers on Fly.io — Sam Ruby's "part showing off, part opinion" developer- blog announcing fly mcp launch, a new flyctl subcommand (shipped in flyctl v0.3.125) that takes any existing local / stdio-style MCP server command and one-shots it into a remote HTTP/Streamable- HTTP MCP server running as a Fly Machine, with bearer-token auth on by default on both ends, client-config JSON rewritten in place for six built-in clients (Claude, Cursor, Neovim, VS Code, Windsurf, Zed), --secret KEY=value flags piped through to Machine secrets, and all Fly platform knobs (auto-stop, Flycast, Volumes, region, VM size) available. Canonical invocation:

fly mcp launch "npx -y @modelcontextprotocol/server-slack" \
  --claude --server slack \
  --secret SLACK_BOT_TOKEN=xoxb-... \
  --secret SLACK_TEAM_ID=T01234567

The post leads with the three-shape MCP-server taxonomy ("basically two types of MCP servers. One small and nimble that runs as a process on your machine. And one that is a HTTP server that runs presumably elsewhere and is standardizing on OAuth 2.1. And there is a third type, but it is deprecated.") and the client-config fragmentation complaint that names Claude Desktop's ~/Library/Application Support/Claude/claude_desktop_config.json (MCPServer key) vs Zed's ~/.config/zed/settings.json (context_servers key) vs OS-dependent per-tool variants — canonical wiki statement of concepts/mcp-client-config-fragmentation. Canonical wiki instance of systems/fly-mcp-launch + patterns/remote-mcp-server-via-platform-launcher. Pairs with the 2025-04-10 flymcp post to span both axes of MCP-server ergonomics: wrap a local CLI as a local stdio MCP server (patterns/wrap-cli-as-mcp-server) and deploy a local MCP server as a remote MCP server (patterns/remote-mcp-server-via-platform-launcher). Post also enumerates supported deployment shapes (one Machine per server / one container per server / in-process library-mode) and access paths (HTTP Authorization / WireGuard / Flycast / reverse proxies). Beta status acknowledged — "examples as shown are thought to work. Maybe."

  • 2025-05-07 — Provisioning Machines using MCPs — Sam Ruby's short developer-blog post marking the mutation transition of Fly.io's flyctl MCP server: the read-only 2-tool prototype from a month earlier (2025-04-10) now covers the full fly volumes subcommand family (create / list / extend / show / fork / snapshots / destroy) — shipped in flyctl v0.3.117. First wiki instance of patterns/wrap-cli-as-mcp-server crossing the read-only → mutating production-resource boundary. "I created my first fly volume using an MCP … and it worked the first time. A few hours later, and with the assistance of GitHub Copilot, i added support for all fly volumes commands." Load-bearing architectural thesis paragraph for concepts/natural-language-infrastructure-provisioning: "Today's state of the art is K8S, Terraform, web based UIs, and CLIs. Those days are numbered." Introduces the "Make it so" target UX — LLM scans code, presents a plan, human adjusts, approves, agent executes, on failure examines logs and proposes next steps. Canonical wiki statement of patterns/cli-safety-as-agent-guardrail — the CLI's pre-existing human-operator refusal invariant ("can't destroy a volume that is currently mounted") becomes the MCP server's authorization boundary for free: "Since this support is built on flyctl, I would have received an error had I tried to destroy a volume that is currently mounted. Knowing that gave me the confidence to try the command." Emergent resource-hygiene UX: Claude spontaneously noted unattached volumes and offered to delete the oldest on request — the agentic troubleshooting loop extended from diagnosis into provisioning-hygiene. Three-way alternative-rejection framing (HTTP Machines API, flyctl directly, web dashboard) as the concepts/agent-ergonomic-cli design-pressure confirmation. Gestures at future MCP servers running on Fly's private network"on separate machines, or in 'sidecar' containers, or even integrated into your app" — pairing with the 2025-04-08 robot-routing / long- lived-SSE framing that gave flymcp its deployment substrate. Local-MCP security posture (concepts/local-mcp-server-risk) continues unchanged; the read-only → mutation transition raises the blast radius accordingly. Caveat: "Just be aware this is not a demo, if you ask it to destroy a volume, that operation is not reversable. Perhaps try this first on a throwaway application." Configuration template: claude_desktop_config.json snippet wires flyctl mcp server as the stdio command; MCP Inspector (local port 6274) is named as the agent-free validation surface for server authors iterating on tool schemas.

  • 2025-05-20 — Litestream: Revamped — Ben Johnson's retrospective on the largest Litestream redesign since its 2020 launch. Three ideas imported from LiteFS: (1) the LTX file format — sorted, transaction-aware SQLite page-range changesets — replaces raw-WAL shipping; adjacent LTX files k-way-merge via LSM-style compaction, so restore to any PITR target costs the compacted state size rather than cumulative WAL volume. (2) CASAASCompare-and-Swap as a Servicetime-based replication lease implemented via object-store conditional writes (S3's 2024-11 launch; Tigris also supports the primitive). Retires LiteFS's Consul dependency for single-leader enforcement; rolling deploys with overlapping Litestream processes against the same destination are now safe; the "generations" abstraction is collapsed to a single-generation invariant. (3) VFS-based read replicas — a SQLite Virtual Filesystem extension loaded into the application that fetches and caches pages directly from object storage; no FUSE required (LiteFS's usability wall). Works in WASM + restricted FaaS environments. Explicit trade named: "this approach isn't as efficient as a local SQLite database" — caching + prefetching are the performance knobs. Two secondary consequences: wildcard / directory replication (/data/*.db) of hundreds or thousands of SQLite databases from one Litestream process is viable for the first time — previously blocked on WAL-polling cost + slow restores. Agent-storage framing: "the robots that write LLM code are going to like SQLite too. … coding agents like Phoenix.new want [a] way to try out code on live data, screw it up, and then rollback both the code and the state. These Litestream updates put us in a position to give agents PITR as a primitive. On top of that, you can build both rollbacks and forks." Ties to the 2025-04-08 RX framing + concepts/stateful-incremental-vm-build. Forward-looking — "we're building", "should be possible" — with no production numbers; 452 HN points (item 44045292). Extends the canonical [SQLite + LiteFS

  • Litestream](<../patterns/sqlite-plus-litefs-plus-litestream.md>) stack (canonical at tkdb): post-revamp the three layers architecturally converge on LTX as the shared wire format. Wiki-first disclosures: LTX file format, CASAAS / object-store conditional-write lease, SQLite VFS as replication integration point, shadow WAL as legacy mechanism.

  • 2025-04-10 — 30 Minutes With MCP and flyctl — Thomas Ptacek's internal-message-turned-blog post on building flymcp — the "most basic" MCP server for flyctl — in 30 minutes, in ~90 lines of Go on mark3labs/mcp-go. Two tools exposed: fly logs + fly status. Canonical wiki instance of patterns/wrap-cli-as-mcp-server. Load-bearing precondition: Fly.io's 2020 decision to give most flyctl commands a --json mode "to make them easier to drive from automation" (concepts/structured-output-reliability, concepts/agent-ergonomic-cli) — five-year-old automation-friendliness decision retroactively became an AI-integration-readiness decision. Pointed at unpkg (Fly-hosted npm CDN), Claude reconstructed the 10-Machine regional topology, flagged 2 machines in critical status, correlated oom_killed: true events, pulled logs on follow-up, and produced a per-second incident timeline (OOM kill → SIGKILL → reboot → health-check fail → listener up → health-check pass, ~43s end-to-end; Bun process at ~3.7 GB of 4 GB allocated) — canonical concepts/agentic-troubleshooting-loop instantiation with a deliberately-minimal tool surface (patterns/tool-surface-minimization + patterns/allowlisted-read-only-agent-actions). Ptacek: "annoyingly useful … faster than I find problems in apps." Closing caveat is the canonical wiki statement of concepts/local-mcp-server-risk"Local MCP servers are scary. I don't like that I'm giving a Claude instance in the cloud the ability to run a native program on my machine. I think fly logs and fly status are safe, but I'd rather know it's safe. It would be, if I was running flyctl in an isolated environment and not on my local machine." — gesturing at patterns/disposable-vm-for-agentic-loop (the 2025-02-07 VSCode-SSH-bananas companion) as the answer. Third RX-era post in a ~three-day window: complements 2025-04-08 Our Best Customers Are Now Robots (RX framing

  • MCP-SSE routing requirement) and ties back to the 2025-02-07 VSCode-SSH-bananas disposable-VM sandbox sketch.

  • 2025-03-27 — Operationalizing Macaroons — Thomas Ptacek's deepest architectural disclosure of Fly.io's security-token stack to date, written as Fly.io hands off internal ownership of the Macaroon project. Two years in as "the Internet's largest user of Macaroons," the user-facing pitch (end-user attenuation, emailing scoped tokens to partners) has been a mixed bag — "users don't really take advantage of token features" — but the infrastructure wins have made the token system "one of the nicer parts of our platform." Canonical wiki introduction of tkdb (~5,000 lines of Go managing a SQLite database via LiteFS + Litestream on isolated hardware in US / EU / AU; records encrypted with an injected secret; "one of the very few well-behaved" infra-SQLite databases at Fly.io) — canonical instance of patterns/isolated-token-service + patterns/sqlite-plus-litefs-plus-litestream. Transport is HTTP/Noise with Noise_IK on the verification path (TLS-analog, broad verifier set) and Noise_KK on the signing path (mTLS-analog, "only a handful" of clients with the keypair). Verification cache hit rate >98% thanks to the chained-HMAC construction's descendant-inheritance property. Revocation is the canonical feed- subscription pattern: tkdb exports a revocation- notifications endpoint, clients poll + prune caches, dump the whole cache on connectivity loss; blacklist-to-every- region explicitly rejected ("we certainly don't want to propagate the blacklist database to 35 regions around the globe"). Cosmetic logout named as the anti-pattern this design prevents. Authorization-vs-authentication split via third-party caveats; service tokens use tkdb's caveat-strip API to remove the authN caveat, and recipients further attenuate locally to bind the token to a specific flyd instance or Fly Machine — "exfiltrating it doesn't do you any good; to use it, you have to control the environment it's intended to be used in." Same caveat-for- privilege-separation pattern runs in reverse at Pet Semetary (Fly's Vault replacement): flyd's read-secret Macaroon has a third-party caveat dischargeable only by proving org permissions through tkdb. Explicit secure-design heuristic: concepts/keep-hazmat-away-from-complex-code"root secrets for Macaroon tokens are hazmat." Telemetry: OpenTelemetry + Honeycomb + permanent-retention OpenSearch audit trail (concepts/context-propagation-otel + concepts/audit-trail-in-opensearch) — Ptacek's explicit retraction of prior OTel skepticism. Operational datum: "the tkdb code is remarkably stable and there hasn't been an incident intervention with our token system in over a year." Culture disclosure: Fly.io self-described "allergic to 'microservices'" — but "a second dedicated security service (Petsem)" alongside tkdb has "pulled its weight"; narrow-purpose security services are the carve-out exception. Closing nod to infrastructure-SQLite as a "total victory for LiteFS, Litestream" — implicit contrast to corrosion's tens-of- gigabytes operational footprint.

  • 2025-04-08 — Our Best Customers Are Now Robots — Thomas Ptacek retrospective naming LLM-driven coding agents as the dominant growth driver on Fly.io over the prior ~6 months and introducing Robot Experience (RX) as a product-design axis alongside UX and DX. Four platform-primitive disclosures that make Fly.io "robot bait" without intending to: (1) compute lifecyclestart vs create split (start is "lightning fast … substantially faster than booting up even a non-virtualized K8s Pod"; "too subtle a distinction for humans" but "the robots are getting a lot of value out of it"); pairs with first wiki disclosure that non-GPU Machines run on Lambda's hypervisor ("Not coincidentally, our underlying hypervisor engine is the same as Lambda's"Firecracker); Lambda-EC2 hybrid positioning ("start like it's spring-loaded, in double-digit millis […] can stick around as long as you want it to"). (2) Storage — LLMs build Machines incrementally, want a filesystem + object storage, not Postgres (concepts/stateful-incremental-vm-build; "what they really need is a filesystem, the one form of storage we sort of wish we hadn't done. That, and object storage."). Fly Volumes + Tigris. (3) Networking — MCP's long-lived SSE connections in multitenant deployments need session-affinity routing (concepts/mcp-long-lived-sse); Fly's dynamic request routing is "possibly a robot attractant" for exactly this shape. Canonical patterns/session-affinity-for-mcp-sse instance. (4) Secrets — tokenized OAuth tokens (concepts/tokenized-secret) let the LLM hold a placeholder while a "hardware-isolated, robot-free Fly Machine" substitutes the real credential at egress (patterns/tokenized-token-broker); grounded in Fly.io's 2024 tokenized-tokens substrate. Forward-looking note: "it should be easy to MCP our API" (not shipped at publication). DX still primary ("the most important engineering work happening today at Fly.io is still DX, not RX; it's managed Postgres (MPG)") but RX is now a first-order concern. Canonical wiki datum for concepts/robot-experience-rx + concepts/vibe-coding + patterns/start-fast-create-slow-machine-lifecycle. Platform-demand-side companion to the 2025-02-14 GPU retrospective (concepts/developers-want-llms-not-gpus). Eighth Fly.io ingest on the wiki.

  • 2024-05-09 — Picture This: Open Source AI for Image Description — Fly.io Machines-team developer-enablement walkthrough of a weekend-scale open-source image-description service (Ollama + LLaVA + PocketBase + LangChainGo) hosted on Fly Machines. Product framing is accessibility (AI-generated alt text for blind users, screen-reader integration via NVDA); architectural substance is the GPU scale-to-zero recipe with a disclosed cold-start number. An Ollama Fly Machine on the a100-40gb preset runs LLaVA-34b behind Flycast; Fly Proxy autostart/autostop stops the Machine after ~minutes of idle silence and starts it on the next internal request from the PocketBase app-tier Machine. Canonical instance of patterns/proxy-autostop-for-gpu-cost-control + patterns/flycast-scoped-internal-inference-endpoint. Disclosed cold-start latency: "starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds." — canonical wiki datum for concepts/gpu-scale-to-zero-cold-start (three-stage budget: Machine-start seconds + model-load-into-GPU-RAM tens of seconds + first-response seconds). Post also names two model payload options on a stopped GPU Machine: Fly Volume for model weights, or bake the model into the Docker image. Side notes: context-window blow-out on the simple followup-chain ("you'll see the quality of responses get poorer — possibly incoherent — as the context exceeds the context window"); modularity claim (swap model + prompt for sentiment / joke-generation / other tasks). Scope disposition: Tier-3 borderline — hobby-project framing but on-scope as GPU-inference scale-to-zero production recipe with a real cold-start number. Sibling to the 2024-09-24 Livebook/FLAME cluster scale-to-zero post (different shape: notebook- driven cluster of 64 L40S Machines vs. single-Machine proxy-autostop here).

  • 2025-02-26 — Taming a Voracious Rust Proxy — Production-incident retrospective on fly-proxy. Two IAD edge hosts pegged CPU + spiked HTTP errors over "some number of hours." Bouncing fly-proxy cleared it; it came back. Pavel pulled a flamegraph — dominated by Rust tracing::Subscriber (which is supposed to be very fast), signature of a spurious-wakeup busy-loop. The fully-qualified Future type in the flamegraph pointed past Fly's own wrappers (Duplex, ReusableReader, PeekableReader, MeteredIo, PermittedTcpStream) to tokio_rustls::server::TlsStream — pre-existing upstream issue tokio-rustls#72 documented exactly this pattern: on orderly TLS close_notify shutdown with still-buffered bytes on the underlying socket, TlsStream mishandles its Waker → 100% CPU. Trigger: Tigris Data load-testing — "tens of thousands of connections" with small HTTP bodies terminating early enough to set up the close_notify-before- EOF scenario. Fix: upstreamed as rustls PR #1950 — canonical Rust-ecosystem instance of patterns/upstream-the-fix + full diagnostic workflow patterns/flamegraph-to-upstream-fix. Self-drawn lessons: (1) patterns/dependency-update-discipline"Keep your dependencies updated. Unless you shouldn't…" — the value is in the process + test infrastructure, not the updates themselves; (2) patterns/spurious-wakeup-metric"Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often." Also canonical one-line statement of fly-proxy's edge-router role: "Edges exist almost solely to run a Rust program called fly-proxy, the router at the heart of our Anycast network." First Fly.io production-incident retrospective ingested; complements the prior architectural (Making Machines Move, JIT WireGuard) + identity (AWS without access keys) posts.

  • 2025-02-14 — We Were Wrong About GPUs — Retrospective / course-correction post by Thomas Ptacek on Fly.io's 2022-era bet on productising GPU Fly Machines. "We're not getting rid of them" + "you'll probably be waiting awhile [for a v2]" — canonical patterns/platform-retrenchment-without-customer-abandonment instance. Three load-bearing disclosures: (1) Hypervisor split — non-GPU Machines on Firecracker, GPU Machines on Intel Cloud Hypervisor (PCI passthrough). Fly rejected QEMU (ms-boot DX required) and VMware (institutional fit). (2) Nvidia driver happy-path disclosure — supported path is K8s- shared-kernel or QEMU/VMware; Fly burned months (and ultimately failed) getting virtualized-GPU drivers working on Cloud Hypervisor, including hex-editing closed-source drivers to impersonate QEMU (concepts/nvidia-driver-happy-path). (3) Demand-side diagnosisdevelopers don't want GPUs, they want LLMs; insurgent clouds can't compete with OpenAI / Anthropic on tokens-per-second (concepts/insurgent-cloud-constraints). Security posture load-bearing: GPUs are just-about-the-worst-case peripheral, mitigated by dedicated GPU-only worker hosts (patterns/dedicated-host-pool-for-hostile-peripheral) + two independent external audits from Atredis and Tetrel (patterns/independent-security-assessment-for-hardware-peripheral). MIG thin-slicing remained unreachable because MIG "gives you a UUID, not a PCI device." L40S customer segment persists as the one SKU that found fit. Asset-backed-bet framing (concepts/asset-backed-bet): GPU hardware is liquidatable like Fly's IPv4 portfolio, so downside is partially recoverable. Parallel drawn to Fly.io's earlier JS-edge-runtime course-correction: "we were wrong about Javascript edge functions, and I think we were wrong about GPUs." Paired with the JP Phillips exit interview two days earlier as the honest-retrospective half of Fly.io's 2025-Q1 blog posture.
  • 2025-02-12 — The Exit Interview: JP Phillips — Q-and-A with the engineer who led flyd — Fly.io's Fly Machines orchestrator — over four years. Architectural retrospective disclosures: (1) flyd's FSM-plus-durable-steps design is ancestry-linked to HashiCorp Cadence + Compose.io/MongoHQ "recipes/operations" (concepts/durable-execution); deploy-tolerance ("pick back up where it left off, post-deploy") is the load-bearing property; (2) JP defends BoltDB over SQLite for flyd's state store — the blast-radius-of-an- ad-hoc-SQL-statement argument plus "limiting the storage interface kept flyd's scope managed" (concepts/bolt-vs-sqlite-storage-choice); (3) alternate design JP would consider: one SQLite per Fly Machine, with schema management as the named open problem (patterns/per-instance-embedded-database); (4) pilot — Fly's new OCI-compliant init — consolidates the feature-bag init and gives flyd a formal driving API; (5) flaps — the Machines-API gateway — is named as decentralised + hits sub-5-second P90 on machine create globally (Johannesburg / Hong Kong excepted); (6) corrosion2 — Fly's SWIM-gossip CRDT-SQLite state distribution system — is JP's "most impressive thing someone else built," with TLA+/Antithesis validation named as the investment gate for external adoption (patterns/formal-methods-before-shipping); (7) OpenTelemetry + Honeycomb are load-bearing ("without oTel it'd be a disaster … I'd have ragequit trying"). Also cultural content (2023 over-hiring, GPU distraction — sibling to sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half) and a stated platform-completeness framing: "The Fly Machines platform is more or less finished … My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like that's been accomplished."
  • 2025-02-07 — VSCode's SSH Agent Is Bananas — Architectural critique of VSCode Remote-SSH from the vantage of integrating Fly Machines into VSCode's remote-editing flow. Contrasts Emacs Tramp (lives off the land, no agent deployed) with VSCode Remote (bash stager → downloads Node.js binary + agent → WebSocket RPC over SSH port-forward → persists across reconnects; "murid in nature" — Fly calls the RAT shape). Names the agentic development loop as the 2025-motivating use-case: "close the loop between the LLM and the execution environment […] a semi-effective antidote to hallucination." Argues for disposable-VM-for-agentic-loop as the answer — "a clean-slate Linux instance that spins up instantly and that can't screw you over in any way" — with Fly Machines as the implied substrate. Productisation post deferred.
  • 2024-09-24 — AI GPU Clusters, From Your Laptop, With Livebook — ElixirConf 2024 keynote recap. Livebook + FLAME + the Nx stack let a notebook running on a laptop drive elastic GPU-cluster compute on Fly Machines. Canonical demos: Llama-on-L40S summarising video-stills pipeline, and 64 L40S Fly Machines hyperparameter-tuning different BERT variants with per-node loss curves streamed back to the notebook in real time. The whole cluster terminates on notebook disconnect (scale-to-zero). Platform-level claim: "start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image" (concepts/seconds-scale-gpu-cluster-boot). Same runtime+FLAME integration now also runs on Kubernetes (Livebook v0.14.1, Michael Ruoss). Canonical instance of patterns/notebook-driven-elastic-compute and patterns/framework-managed-executor-pool.
  • 2024-08-15 — We're Cutting L40S Prices In Half — GPU strategy retrospective + price cut. Customer data surprised Fly.io: the least capable GPU (A10) is the most popular by a wide margin. Fractional-A100 via MIG / vGPU + IOMMU PCI passthrough failed ("a project so cursed"); pivot to whole-GPU attachment. L40S cut to $1.25/hr (= A10 price) to collapse the inference-GPU choice. Architectural thesis: inference = transaction, training = batch; for transaction-shaped inference, the combination of GPU + RAM + Tigris + Anycast beats a bigger GPU. Canonical instance of GPU + object-storage co-location.
  • 2024-07-30 — Making Machines Move — Year-long rebuild of fleet-drain for stateful Fly Machines with attached Fly Volumes. Introduces a clone primitive (killcloneboot) built on the Linux kernel's dm-clone device-mapper target; clone returns immediately, a new Machine boots on the target worker, reads of un-hydrated blocks fall through to the source worker over iSCSI (NBD tried first and abandoned — kernel threads got stuck under network disruption), and kcopyd rehydrates in background. Gnarly complications: cryptsetup version skew → LUKS2 header-size drift (4 MiB vs 16 MiB) → RPC in migration FSM to carry metadata; 6PN address-embeds-routing → migration changes addresses → Fly Postgres configs hardcoded literal addresses → in-init address-mapping bridge + fleet-wide config rewrite; Corrosion (SWIM-gossip SQLite)'s worker-is-source-of-truth invariant breaks. Ends with a nod to LSVD (NVMe-cache + object-store) as the medium-term direction. "This is the biggest thing our team has done since we replaced Nomad with flyd."
  • 2024-06-19 — AWS without Access Keys — Fly.io's OIDC IdP (oidc.fly.io/<org>) + AssumeRoleWithWebIdentity → Fly Machines get AWS S3 access with no keypair ever stored; Fly init detects AWS_ROLE_ARN, fetches an OIDC token via /.fly/api, writes it to /.fly/oidc_token, exports AWS_WEB_IDENTITY_TOKEN_FILE for the AWS SDK. Macaroon- scoped-per-Machine identity; SSRF-resistant Unix-socket API proxy; <org>:<app>:<machine> sub-field scoping for trust policies. Canonical wiki source for OIDC federation for cloud access and patterns/oidc-role-assumption-for-cross-cloud-auth.
  • 2024-03-12 — JIT WireGuard — Gateway-fleet WireGuard peer provisioning flipped from NATS-push to pull-on-first-packet. BPF-sniffs handshake initiations, runs ~200 lines of Noise crypto to identify the peer, pulls config from the Fly API, installs via Netlink. Kernel stale-peer count: ~550k → "rounds to none." Also documents Fly.io's broader migration off NATS for internal RPCs (flyd now HTTP).
  • 2024-03-07 — Fly Kubernetes does more now — Fly Kubernetes launched on public beta; Pods-as-Firecracker- micro-VMs; flyd orchestrator; integration surfaces across Fly Machines.
  • 2024-02-15 — Globally Distributed Object Storage with Tigris — Tigris public beta; architectural framing (FoundationDB + NVMe + QuiCK-style queue + S3 backend); fly storage create CLI; unified-billing framing.

Notes on tier

Fly.io is a Tier-3 source on the sysdesign-wiki. Per AGENTS.md, Tier-3 ingests require the post to clearly cover distributed- systems internals, scaling trade-offs, infrastructure architecture, production incidents, or similar — product-PR and feature-announcement posts are skipped. The Tigris public-beta post is borderline (it's a launch announcement) but is ingested because the three-layer architectural statement (FDB + NVMe-cache + QuiCK-queue) is load-bearing distributed-systems content.

Last updated · 200 distilled / 1,178 read