Skip to content

FLYIO 2025-02-12

Read original ↗

Fly.io — The Exit Interview: JP Phillips

Summary

Exit-interview blog post (2025-02-12) with JP Phillips, the engineer who led flyd — Fly.io's in-house orchestrator for Fly Machines (Firecracker micro-VMs) — over his 4-year tenure. The interview is Q-and-A with Thomas Ptáček and is off-hand in voice, but it is substantively an architectural retrospective on the flyd design, the infrastructure JP is proudest of, and the corners he would revisit. Four reusable pieces of wiki content: (1) the flyd FSM design is rooted in earlier work with [Compose.io / MongoHQ "recipes/operations"] and with Cadence at HashiCorp — "once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design"; (2) BoltDB was the right choice for flyd's state store — "I've never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet, and then mangled all our state data. And limiting the storage interface, by not using SQL, kept flyd's scope managed"; (3) per-Fly-Machine SQLite is the alternate design JP would consider if doing it over — "with per-Machine SQLite, once a Fly Machine is destroyed, we can just zip up the database and stash it in object storage. The biggest hold-up I have about it is how we'd manage the schemas"; (4) pilot is Fly's new init — an OCI-compliant runtime with an API for flyd to drive — consolidating features that had accreted onto the original init binary and giving flyd, for the first time, a defined contract with process-zero inside a Fly Machine. Additional substance: the flaps API gateway for the Machines API has sub-5-second P90 on machine create for every region except Johannesburg and Hong Kong; JP names corrosion2 — Fly's SWIM-gossip CRDT-SQLite state-distribution system — as the most impressive piece of infrastructure he saw someone else build at Fly, and explicitly flags its potential value to external companies "if we invested in Antithesis or TLA+ testing." JP attributes Fly's OpenTelemetry adoption to himself and frames it as load-bearing ("without oTel, it'd be a disaster trying to troubleshoot the system; I'd have ragequit trying"). Three cultural observations are recorded on ambiguous-bottom-up management: too-easy-to-lose-sight-of-value, inconsistent communication, direction-changes-too-often — JP gives Kurt a 2★ rating for 2023 citing "we hired too many people, too quickly, and didn't have the guardrails and structure in place for everybody to be successful" + the GPU distraction. Post is an interview, not a deep-dive — architectural content is concentrated but not exhaustive.

Key takeaways

  1. The flyd FSM/durable-execution pattern is ancestry-linked to Cadence (Temporal's predecessor) and to Compose.io/MongoHQ "recipes". "I think the FSM stuff is a result of work I did at Compose.io / MongoHQ (where it was called 'recipes' / 'operations') and the work I did at HashiCorp using Cadence. … Cadence is the child of AWS Step Functions and the predecessor to Temporal (the company)." The motivating constraint was continuous-deploy: "if flyd was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy." Canonical wiki-link from durable execution to systems/flyd; a new canonical link from systems/flyd back to systems/cadence + systems/temporal as lineage, not dependency.
  2. BoltDB over SQLite for an orchestrator state store — JP stands by the choice. "I still believe Bolt was the right choice. I've never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet, and then mangled all our state data. And limiting the storage interface, by not using SQL, kept flyd's scope managed. On the engine side of the platform, which is what flyd is, I still believe SQL is too powerful for what flyd does." This is the inverse of the reasoning that drove corrosion to SQLite — corrosion wants generic queryability because its consumers want SQL. The argument generalises: pick your state-store's query surface by your blast radius for an ad-hoc query, not by feature surface.
  3. Per-Fly-Machine SQLite is the alternate design JP would pick if doing it again — but only if schema management were solved. "I'd maybe consider a SQLite database per-Fly- Machine. Then the scope of danger is about as small as it could possibly be. … Yeah, with per-Machine SQLite, once a Fly Machine is destroyed, we can just zip up the database and stash it in object storage. The biggest hold-up I have about it is how we'd manage the schemas." Canonical wiki instance of the per-instance embedded database pattern (sibling to Cloudflare's Durable Objects one-SQLite-per-object model and to per-tenant search indices); schema-management is the specific open problem named.
  4. pilot is Fly's next-generation init — OCI-compliant, with a defined API for flyd to drive. "pilot is our new init. When we launch a Fly Machine, init is our foothold in the machine … pilot consolidates those features, and, more importantly, is itself a complete OCI runtime; pilot can natively run containers inside of Fly Machines. Before pilot, there really wasn't any contract between flyd and init. And init was just 'whatever we wanted init to be'. That limited its ability to serve us. Having pilot be an OCI-compliant runtime with an API for flyd to drive is a big win for the future of the Fly Machines API." Two wiki implications: (a) a new pilot system page (distinct from the existing systems/fly-init page, which stays as the historical record of the Rust init described in 2024-06-19); (b) the init-contract-to-flyd framing is a reusable OCI-init-contract pattern — don't let your init drift into an undocumented feature bag; pin it to a formal runtime contract.
  5. flaps is the Machines-API gateway, and it's not the whole Machines API. "The flaps API server, the flyd RPCs it calls, the flyd finite state machine system, the interface to running VMs." — JP's tour of what he considers "the whole Fly Machines API." flaps serves https://docs.machines.dev, makes RPCs into per-host flyd processes across the global fleet, and "mostly doesn't require any central coordination." Disclosed number: P90 for Fly Machine create calls is sub-5-second for every region except Johannesburg and Hong Kong. Canonical wiki instance for a distributed-orchestrator API gateway without a central scheduler.
  6. corrosion2 is Fly's SWIM-gossip CRDT-SQLite state distribution system, and JP would open-source it harder. "corrosion2 is our state distribution system. While flyd runs individual Fly Machines for users, each instance is solely responsible for its own state; there's no global scheduler. But we have platform components, most obviously fly-proxy, our Anycast router, that need to know what's running where. corrosion2 is a Rust service that does SWIM gossip to propagate information from each worker into a CRDT-structured SQLite database. corrosion2 essentially means any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world." The post also flags the investment gap for external adoption: "If we invested in Antithesis or TLA+ testing, I think there's potential for other companies to get value out of corrosion2." Supersedes the "corrosion" lineage (the 2024-07-30 post called it "Corrosion, the SWIM-gossip SQLite database" — the 2 suffix is new disclosure here). Structurally: same SWIM-gossip + SQLite substrate, new emphasis on the CRDT structure of the table data.
  7. OpenTelemetry is load-bearing at Fly.io. JP: "Without oTel, it'd be a disaster trying to troubleshoot the system. I'd have ragequit trying." Thomas: "I basically attribute oTel at Fly.io to you." Honeycomb is the trace backend cited: "we didn't have the best track record running a logs/metrics cluster at this fidelity. It was worth the money to pay someone else to manage tracing data." Adjacent claim: "I think the next big part of oTel is going to be auto-instrumentation, for profiling." Canonical wiki citation for Fly.io's instrumentation posture.
  8. The Go vs. Rust vs. Elixir cocktail — honest take. "Most of our backend is in Go, but fly-proxy, corrosion2, and pilot are in Rust." JP's 3 nice things about Rust: Option / match / serde macros. On the mix: "Three's a crowd, Elixir can stay home." On whether Ruby is staying: "Ruby is functionally dead here, and Elixir is ascendant" — editorial note from Thomas, contradicting JP's Elixir preference. Wiki value: confirms which Fly.io services are Rust (canonical list for 2025: fly-proxy, corrosion2, pilot, and per sources/2024-06-19-flyio-aws-without-access-keys also fly-init).
  9. Culture: bottom-up ambiguity + 2023 overhiring + GPU distraction are the named negatives. JP's Kurt ratings (on a 4-star scale): 2022 ★★★★, 2023 ★★, 2024 ★★✩ (2.5), 2025 ★★★✩ (3.5). "We hired too many people, too quickly, and didn't have the guardrails and structure in place for everybody to be successful." "GPUs were a killer distraction." Also: "We struggle a lot with consistent communication. We change direction a little too often." Sibling to the Fly.io GPU retrospective already in the wiki (sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half), which also cops to the distraction — but from the product side. The exit interview corroborates from the engineering side.
  10. Why JP is leaving: "it didn't really match up with where we're currently heading. Specifically, with our new focus on MPG [Managed Postgres] and [llm]. … More directly positioned as a cloud provider, rather than a platform-as-a-service." The Fly Machines platform itself is framed as "more or less finished, in the sense of being capable of supporting the next iteration of our products. My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like that's been accomplished." Wiki value: 2025 framing of Fly's self- described platform completeness** — Machines + flyd are done; future product work is vertical (MPG, LLM inference, cloud-provider positioning) not horizontal (more orchestration primitives).

Architectural details

flyd — finite-state-machine orchestrator with BoltDB durability

Consistent with the 2024-07-30 Making Machines Move source. The exit interview restates it from the architect's mouth:

  • Structure: "flyd runs independently without any central coordination on thousands of 'worker' servers around the globe. It's structured as an API server for a bunch of finite state machine invocations, where an FSM might be something like 'start a Fly Machine' or 'create a new Fly Machine' or 'cordon off a Fly Machine so we can update it'. Each FSM invocation is comprised of a bunch of steps, each of those steps has callbacks into the flyd code, and each step is logged in a BoltDB database."
  • Design ancestry: Compose.io/MongoHQ "recipes/operations" + HashiCorp Cadence. The durable-execution idea is not novel to flyd; the specific BoltDB-not-SQL embodiment is.
  • Deploy tolerance: the durability property is continuous-deploy-survivability. flyd is deployed constantly; any in-flight FSM must resume at its last recorded step post-redeploy.
  • Scope discipline via storage: JP credits the non-SQL interface for keeping flyd's scope contained. "limiting the storage interface, by not using SQL, kept flyd's scope managed." A full-SQL store invites ad-hoc queries, ad-hoc update statements, and ad-hoc schema changes — corrosion-style features that suit a read-side state-distribution plane but not an engine.

corrosion2 — SWIM-gossip CRDT-SQLite

New wiki disclosure this post: the 2 suffix + explicit CRDT structure of the SQLite tables. Prior wiki material (the 2024-07-30 post) introduced "Corrosion, the SWIM-gossip SQLite database"; JP now separates the Corrosion-2 redesign from the v1 lineage: "we deployed corrosion, learned from it, and were able to make significant and valuable improvements — and then migrate to the new system in a short period of time."

Architecturally:

  • SWIM gossip: propagation substrate (wiki: Noise-adjacent but separate; Fly.io's other SWIM use is unrelated).
  • CRDT-structured SQLite: each node's SQLite holds conflict-free merge-friendly table data; merges are mechanical.
  • API surface for consumers: "any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world." SQL is the consumer interface, not the consistency substrate.
  • External-adoption framing: "Having a 'just SQLite' interface, for async replicated changes around the world in seconds, it's pretty powerful." JP's suggested investment for external viability: Antithesis or TLA+ testing — a deterministic-simulation / model- checking-driven validation story, citing the pattern patterns/formal-methods-before-shipping already canonical in this wiki (MongoDB Raft, systems/tla-plus).

Substrate comparison: corrosion2 is the read-side state distribution tier; flyd's BoltDB is the authoritative per-worker state tier. Different storage picks because different query surfaces.

pilot — next-generation init with OCI-runtime + flyd contract

JP's answer to "The rest of the platform, you're fine with?":

I'm happier now that we have pilot. pilot is our new init. When we launch a Fly Machine, init is our foothold in the machine; this is unlike a normal OCI runtime, where "pid 1" is often the user's entrypoint program. Our original init was so simple people dunked on it and said it might as well have been a bash script; over time, init has sprouted a bunch of new features. pilot consolidates those features, and, more importantly, is itself a complete OCI runtime; pilot can natively run containers inside of Fly Machines.

Before pilot, there really wasn't any contract between flyd and init. And init was just "whatever we wanted init to be". That limited its ability to serve us.

Having pilot be an OCI-compliant runtime with an API for flyd to drive is a big win for the future of the Fly Machines API.

Delta vs. Fly init (Rust, 2024-06-19):

Dimension init (pre-pilot) pilot (2025)
Role in Machine PID 1, feature-bag PID 1, OCI runtime
Contract with flyd Ad-hoc / absent Formal API
Runs containers No — launches a customer entrypoint Yes — natively
Consolidation N/A Absorbs prior init features

This is the runc / containerd-for-Fly-Machines move: before, flyd + init handled whatever; after, flyd calls a defined OCI-runtime API on pilot inside the Machine.

flaps — Machines API gateway

"Yes, all of it. The flaps API server, the flyd RPCs it calls, the flyd finite state machine system, the interface to running VMs." — JP's tour of the "whole Fly Machines API."

  • Scope: flaps = the API server behind docs.machines.dev, specifically the POST /apps/{app_name}/machines endpoint.
  • Decentralisation: "I like that it for the most part doesn't require any central coordination."
  • Performance: "the P90 for Fly Machine create calls is sub-5-seconds for pretty much every region except for Johannesburg and Hong Kong."
  • Implementation: not disclosed whether flaps is Go or Rust; not enumerated in JP's Rust-services list (fly-proxy, corrosion2, pilot), so likely Go.

Per-Fly-Machine SQLite — the "if I had to do it over"

JP's alternate design:

But, I'd maybe consider a SQLite database per-Fly-Machine. Then the scope of danger is about as small as it could possibly be. … Yeah, with per-Machine SQLite, once a Fly Machine is destroyed, we can just zip up the database and stash it in object storage. The biggest hold-up I have about it is how we'd manage the schemas.

Wiki shape: a per-instance embedded database pattern, whose canonical wiki instances include:

What JP names as the specific open problem: "how we'd manage the schemas" — schema migration across a fleet of per-instance databases that come and go with Machine lifetime is the pattern's unsolved hard part.

Numbers disclosed

  • P90 machine create latency: sub-5-seconds for every region except Johannesburg and Hong Kong.
  • Kurt rating histogram (4-star scale): 2022 ★★★★ / 2023 ★★ / 2024 ★★✩ / 2025 ★★★✩.

Numbers not disclosed

  • flyd replica / worker count. ("thousands of worker servers" is qualitative.)
  • corrosion2 node count, table-row count, gossip-convergence latency.
  • BoltDB size on a typical flyd host.
  • pilot rollout completion (full-fleet? beta?).
  • Honeycomb spend numbers.
  • Ratio of Go to Rust LOC across Fly.io.

Caveats

  • Interview voice, not a deep-dive — architectural content is dense but not exhaustive. Where a claim contradicts prior wiki coverage, defer to the deep-dive sources (2024-07-30 Making Machines Move for flyd + dm-clone migration; 2024-06-19 AWS without Access Keys for init + OIDC + Macaroons; 2024-03-12 JIT WireGuard for fly-gateway + NATS retirement).
  • JP's one-person-sample opinions are recorded as such — e.g. the Kurt rating histogram, the "Elixir can stay home" line, the Bolt-over-SQLite defence. These are JP's views, not Fly.io's official position.
  • Ruby status — editorial note (Thomas, not JP) says "Ruby is functionally dead here, and Elixir is ascendant." Wiki caveat: this is Thomas's note, not JP's statement; no deprecation timeline or affected-codebase enumeration given.
  • [llm] redaction — the post literally redacts one of the two named strategic directions ("our new focus on MPG and [llm]"). Fly.io has since publicly discussed LLM-related products, but the redaction in this source limits what can be wiki'd.
  • MPG — Managed Postgres is named but not architected in this post. Wiki policy: create no MPG stub from this source; wait for a dedicated MPG post.
  • corrosion2 vs. corrosion — the 2 suffix is disclosed here; no migration-mechanics detail ("short period of time" is all JP says). The wiki systems/corrosion-swim page is updated to note the v2 designation.
  • Antithesis / TLA+ — mentioned as investments Fly.io has not made on corrosion2; not a product endorsement.
  • No Corrosion blog post exists yet — JP teases "deserves its own post" echoing the 2024-07-30 framing; still not delivered.

Relationship to existing wiki

  • Extends systems/flyd: adds the FSM-lineage line (Compose.io / MongoHQ "recipes" + HashiCorp Cadence) and JP's canonical Bolt-over-SQLite defence. The 2024-07-30 Making Machines Move coverage of flyd remains the deep architecture; this source supplements with the author's own framing of why.
  • Extends systems/corrosion-swim: adds the 2 suffix + CRDT-structured-SQLite framing + Antithesis/TLA+-investment- needed external-adoption framing. JP's "most impressive thing someone else built here" elevates corrosion2 on the wiki.
  • Creates systems/fly-pilot: distinct from systems/fly-init (the pre-pilot Rust init from 2024-06-19). pilot is the OCI-compliant successor with a formal flyd contract.
  • Links systems/cadence + systems/temporal to systems/flyd as lineage, not runtime dependency. Both Cadence and Temporal are already canonical wiki pages (MongoDB + broader durable-execution coverage).
  • Confirms systems/boltdb choice for flyd via JP's explicit defence: the first wiki-quality defence of Bolt-because-it's-not-SQL-so-nobody-can-foot-gun-the-fleet.
  • Confirms patterns/pull-on-demand-replacing-push era at Fly.io — flyd "went from NATS-driven to HTTP" already cited from 2024-03-12; JP reaffirms the cultural move.
  • Cross-refs concepts/durable-execution — flyd now has a canonical ancestry citation (via Cadence) on top of the MongoDB + Cloudflare Workflows anchors already there.
  • Cross-refs patterns/formal-methods-before-shipping — JP's Antithesis / TLA+ framing on corrosion2 is the canonical wiki acknowledgement of what validation would graduate a Fly.io internal component to external viability.

Source

Last updated · 200 distilled / 1,178 read