Skip to content

FLYIO 2025-02-14 Tier 3

Read original ↗

We Were Wrong About GPUs

Retrospective / course-correction post by Thomas Ptacek on Fly.io's 2022-era bet on productising GPU Fly MachinesFirecracker-shaped hardware-virtualized Fly Machines with PCI-passthrough Nvidia GPUs, running on Intel Cloud Hypervisor. Framed as "we're not getting rid of them, but we're scaling back" and paired with the JP Phillips exit interview from two days earlier as the honest-retrospective half of the Fly.io blog's 2025-Q1 posture. Three load-bearing disclosures clear the Tier-3 bar:

  1. A concrete enumeration of what productising GPU micro-VMs cost Fly.io engineering — dedicated hardware pool, two independent security assessments, months of (ultimately failed) work to map virtualized Nvidia GPUs into Intel Cloud Hypervisor.
  2. The Nvidia "happy path" disclosure — Nvidia's driver support is engineered for K8s-with-shared-kernel or QEMU/VMware; micro-VM hypervisors are off the supported path, and Fly.io's developer-experience requirement (millisecond boot) forced the off-path choice.
  3. A demand-side diagnosis: "developers don't want GPUs. They don't even want AI/ML models. They want LLMs." The insurgent- cloud can't compete with OpenAI / Anthropic on tokens-per-second for transaction-shaped developer workloads.

Summary

Fly.io bet in 2022 that application developers shipping apps would want transaction- shaped inference near their compute. The Fly GPU Machine was the product: a Fly Machine with a hardware-mapped Nvidia GPU, running on Cloud Hypervisor (Firecracker's sibling that supports PCI passthrough) rather than Firecracker itself. The engineering lift was substantial — Nvidia's driver ecosystem isn't geared to micro-VM hypervisors; the security team treated GPUs as "just about the worst case hardware peripheral"; Fly had to stand up dedicated GPU-only hardware (reducing utilisation vs non-GPU machines), fund two external security assessments (Atredis + Tetrel), and burned months failing to map virtualized-GPUs into Cloud Hypervisor. At one point Fly hex-edited Nvidia's closed-source drivers to trick them into thinking the hypervisor was QEMU. Two years in, the bet isn't paying off. The diagnosis the post offers is demand-side: the developers who would use Fly Machines mostly want LLM access, not GPU access, and OpenAI / Anthropic's APIs are fast enough that millisecond inference-compute-locality doesn't matter for their workloads. What remains: a workable L40S customer segment ("a bunch of these", not a core-business driver) and a useful learning. Explicit parallel drawn to Fly.io's earlier edge JavaScript runtime pivot: "we were wrong about Javascript edge functions, and I think we were wrong about GPUs."

Key takeaways

  1. "Developers don't want GPUs. They don't even want AI/ML models. They want LLMs." The demand-side framing for why Fly.io's GPU bet isn't paying off. "When a software developer shipping an app comes looking for a way for their app to deliver prompts to an LLM, you can't just give them a GPU." For an insurgent cloud, competing with OpenAI / Anthropic on tokens-per-second is not plausible. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  2. GPU micro-VM virtualisation is off Nvidia's supported path. "The Nvidia ecosystem is not geared to supporting micro-VM hypervisors." Nvidia's happy path is K8s-with-shared-kernel (Fly's customers share worker hosts, not kernels) or a conventional hypervisor (VMware / QEMU). Fly.io burned months trying (and ultimately failing) to get Nvidia's host drivers working against Cloud Hypervisor — including hex-editing closed-source drivers to impersonate QEMU. The MIG / vGPU thin-slicing market segment remained inaccessible. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  3. Fly Machine developer experience vs the Nvidia happy path forced the off-path choice. QEMU would have been security-defensible and compatible with Nvidia's drivers — but Fly Machines boot in milliseconds, and QEMU couldn't deliver that DX. Fly chose Cloud Hypervisor + the driver-integration cost over QEMU + the boot-latency cost. "We could not have offered our desired Developer Experience on the Nvidia happy-path." (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  4. GPUs were "just about the worst case hardware peripheral" for a multi-tenant security team. "Intense multi-directional direct memory transfers, with arbitrary, end-user controlled computation, all operating outside our normal security boundary." In common configurations, GPUs talk to each other, not just host↔GPU. Fly mitigated by (a) dedicated GPU servers — no mixed GPU / non-GPU workloads on the same box — and (b) two large external security assessments from Atredis and Tetrel. Dedicated hardware was a secondary cost: the only reason a Fly Machine was scheduled on a GPU worker was to claim a PCI BDF, and there's a bounded number per box, so GPU servers ran drastically less utilised than general workers. Canonical patterns/dedicated-host-pool-for-hostile-peripheral + patterns/independent-security-assessment-for-hardware-peripheral wiki instance. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  5. The "cheap thin-sliced GPU for developers" market remained unreachable. Fly.io thinks there is probably a market for developers doing lightweight ML on tiny GPUs (MIG slices one big GPU into arbitrarily small virtual ones), but "for fully-virtualized workloads, it's not baked; we can't use it." The MIG instance presents as a UUID to the host driver, not a PCI device — which breaks the PCI-passthrough model Fly depends on. Can't prove the customers exist because the segment was never reached. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  6. The inference- locality thesis survives but isn't market-decisive yet. "We have app servers, GPUs, and object storage all under the same top-of-rack switch. But inference latency just doesn't seem to matter yet, so the market doesn't care." Developers shipping on AWS will tolerate outsourcing to a GPU-specialist cloud and paying egress on gigabytes of model weights from S3 — because tokens-per-second is what matters, not milliseconds. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  7. Serious-AI customers want scale Fly.io can't give them. "People doing serious AI work want galactically huge amounts of GPU compute. A whole enterprise A100 is a compromise position for them; they want an SXM cluster of H100s." The Fly.io product doesn't reach that ceiling, and the thin-sliced-GPU floor is driver-gated. What remains is the middle: L40S customers — "there are a bunch of these!" — who are the 2024-08-15 L40S price cut beneficiaries. Persisting but not core-business-driving. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  8. Asset-backed bets are structurally cheaper to be wrong about. GPUs will liquidate. Parallel drawn to Fly.io's IPv4 address portfolio — another tradable-with-durable-value asset. "I'm even more comfortable making bets backed by tradable assets with durable value." Canonical wiki instance of concepts/asset-backed-bet. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  9. Fly.io's "design for 10,000, not 5-6" credo as a product-fit filter. "We design for 10,000 developers, not for 5-6. It took a minute, but the credo wins here: GPU workloads for the 10,001st developer are a niche thing." The credo is how Fly eventually diagnosed the mismatch. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

  10. Retrenchment without customer abandonment. "If you're using Fly GPU Machines, don't freak out; we're not getting rid of them." No v2 product, no new hardware investment, existing workloads stay online. Canonical wiki instance of patterns/platform-retrenchment-without-customer-abandonment. Complements the JP Phillips exit interview's "Machines is finished" framing: Fly.io is in a scope-consolidation posture in early 2025. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

Operational numbers

  • Engineering time lost to Nvidia-driver integration on Cloud Hypervisor: "burned months trying (and ultimately failing)". No specific headcount or calendar span disclosed beyond "months".
  • Security assessments: two, from Atredis and Tetrel. "They were not cheap, and they took time." No dollar figure or time range disclosed.
  • Utilisation delta — dedicated GPU servers vs general Fly workers: "drastically less utilised and thus less cost-effective." Qualitative only.
  • Year of GPU bet: "a couple years back" / "when we embarked down this path in 2022".
  • Hypervisor choice: Intel Cloud Hypervisor for GPU Machines (PCI passthrough support); Firecracker for non-GPU Machines. Both are "very similar Rust codebase[s]."
  • Hex-edit hack disclosed: Fly at one point hex-edited Nvidia's closed-source host drivers to trick them into thinking the hypervisor was QEMU. No details on which driver binary or what surface was patched.
  • GPU SKUs still sold: L40S remains useful; enterprise A100 and SXM H100 clusters are the training-side ceiling Fly can't match.

Extracted systems

  • systems/intel-cloud-hypervisor — Rust KVM-based VMM used for Fly GPU Machines because it supports PCI passthrough. Described as "a very similar Rust codebase" to Firecracker. First wiki appearance; new system page.
  • systems/firecracker — extended: Fly.io uses Firecracker for non-GPU Machines but not for GPU Machines; the GPU path requires PCI passthrough which Firecracker doesn't surface. This post is the cleanest wiki disclosure of the hypervisor-per-workload-class split.
  • systems/fly-machines — extended: the GPU Machine variant runs on Cloud Hypervisor instead of Firecracker; this post is the canonical disclosure of that split.
  • systems/nvidia-l40s — extended: the one SKU that found a product-market fit in Fly's inventory ("a bunch of these").
  • systems/nvidia-a100 / systems/nvidia-h100 — extended: the serious-AI ceiling Fly can't reach ("an SXM cluster of H100s").
  • systems/nvidia-mig — extended: thin-sliced GPU for developers remained inaccessible because MIG presents as a UUID to the host driver, not a PCI device.
  • systems/qemu / systems/vmware — named as the conventional-hypervisor alternatives on Nvidia's happy path; Fly rejected both because of millisecond-boot DX (QEMU) or institutional fit (VMware). Minimal new system pages — wiki touchpoints, not deep dives.
  • systems/flyd — extended: the orchestrator carries the engineering cost of Machine-vs-GPU-Machine split and the Nvidia-driver-shaping in the root filesystem path.

Extracted concepts

  • concepts/developers-want-llms-not-gpus — Fly.io's canonical demand-side framing for why the GPU bet is not working. New concept page.
  • concepts/gpu-as-hostile-peripheral — the security-side framing. "Intense multi-directional direct memory transfers with arbitrary, end-user-controlled computation, all operating outside our normal security boundary." New concept page.
  • concepts/nvidia-driver-happy-path — the shape of Nvidia's supported-path (K8s-with-shared-kernel or QEMU/VMware) and the cost of deviating from it. New concept page.
  • concepts/fast-vm-boot-dx — the millisecond-boot DX Fly wasn't willing to give up for GPU Machines, which forced the off-happy-path hypervisor choice. New concept page.
  • concepts/asset-backed-bet — how tradable hardware / IPv4 portfolios change the risk profile of infra bets. New concept page.
  • concepts/insurgent-cloud-constraints — Fly.io framing for why an insurgent cloud can't beat OpenAI / Anthropic at tokens-per-second for transaction-shape workloads. New concept page.
  • concepts/product-market-fit — this post is the wiki's cleanest statement of course-correcting when you don't find it ("a startup is a race to learn stuff"). New concept page.
  • concepts/inference-compute-storage-network-locality — existing concept; the thesis survives per this post ("we really like the point in the solution space we found") but the market hasn't yet valued it for transaction-shaped developer inference.

Extracted patterns

  • patterns/dedicated-host-pool-for-hostile-peripheral — segregate worker hardware so a risky peripheral (GPU, TPM, FPGA, custom NIC) can't affect non-peripheral tenants. Fly.io GPU Machines run on GPU-only hosts — not because GPU workloads need it, but because mixing them with non-GPU tenants on one box is a multi-tenant security posture the team wasn't comfortable defending. New pattern page.
  • patterns/independent-security-assessment-for-hardware-peripheral — before productising a hostile peripheral on a multi-tenant platform, fund one or more independent external security assessments. Fly.io's GPU deployment had two (Atredis, Tetrel). New pattern page.
  • patterns/platform-retrenchment-without-customer-abandonment — when a product bet doesn't pay off, scale back the forward investment ("a v2 of the product, you'll probably be waiting awhile") while keeping existing customers whole ("don't freak out; we're not getting rid of them"). New pattern page.

Caveats

  • Retrospective, not architecture paper. The substantive disclosures (hypervisor choice, dedicated-hardware pool, security assessments, driver integration failure) are stated, not walked end-to-end. No code / FSM / boot-sequence description; no comparative benchmark numbers.
  • Nvidia-driver failure not diagnosed. The post says the Cloud-Hypervisor-to-virtualized-Nvidia-GPU integration didn't work, but doesn't say at what layer it failed (driver probe? memory mapping? mmio passthrough?).
  • No quantitative data on GPU-Machines business. No rev, utilisation, customer count, SKU mix, L40S-vs-A10-vs-A100 breakdown. "Not a hit" is the only level of specificity.
  • The insurgent-cloud-vs-LLM-API thesis is argued, not proven. Opaque to which workloads this holds for; Fly.io concedes some inference customers for whom locality does matter still exist.
  • Hex-editing Nvidia drivers is named but not characterised — no surface, no CVE-ish shape, no portability claim. It's anecdotal evidence of how far Fly pushed, not a reusable technique.
  • Product framing. "This makes us sad because we really like the point in the solution space we found." The post reads as course-correction-with-regret, not neutral post-mortem. Treat the architectural disclosures as load- bearing and the framing as editorial.
  • "We were wrong" analogy to JS edge functions is invoked but not elaborated; the JS-edge-runtime history is not on the wiki yet.

Source

Last updated · 178 distilled / 1,178 read