Skip to content

CONCEPT Cited by 1 source

GPU as hostile peripheral

Definition

The security-posture framing that treats a consumer/enterprise GPU as the worst-case hardware peripheral for a multi-tenant platform. The canonical wiki statement (Fly.io, 2025-02-14):

GPUs [terrified our security team]. A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers … with arbitrary, end-user controlled computation, all operating outside our normal security boundary.

(not even bidirectional: in common configurations, GPUs talk to each other)

(Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

The framing is a deliberate generalisation: GPU is instance-zero, but the same argument applies to any PCI-passthrough accelerator (FPGA, TPU, custom NIC, cryptographic HSM) with DMA capability and an Internet-exposed driver / firmware surface.

The four properties that make a GPU hostile

  1. DMA-capable. The peripheral reads/writes host memory directly. An exploit in the driver or firmware translates to arbitrary kernel memory access — i.e. escape from the VM.
  2. End-user-controlled computation. The tenant ships arbitrary CUDA / shader code to the device. Unlike, say, a NIC (where the firmware interprets packets from a limited protocol) or a disk controller (where the firmware interprets SCSI / NVMe commands), the tenant's compute runs on the peripheral.
  3. Multi-directional peripheral-to-peripheral I/O. GPUs NVLink/NVSwitch/PCIe-P2P talk to each other on the host. A compromised device can attack neighbouring GPUs on the same host, widening the blast radius beyond the per-VM boundary.
  4. Closed-source, rapidly-evolving driver/firmware surface. Nvidia's driver stack is proprietary, large, and ships new features (and CVEs) on a regular cadence. The platform operator can't audit the source.

Google Project Zero's 2020 "Attacking the Qualcomm Adreno GPU" series is the explicit reference Fly.io links — a class of well-documented GPU-driver exploits.

Mitigation patterns

Fly.io's posture on top of Intel Cloud Hypervisor micro-VMs:

  • Dedicated host pool. GPU workers do not run non-GPU tenants. "Fly Machines [not assigned a GPU] weren't mixed" on the GPU workers. Reduces the blast radius to "GPU tenants on this host" rather than "all tenants on this host". See patterns/dedicated-host-pool-for-hostile-peripheral.
  • PCI-BDF-bounded placement. The only reason a Fly Machine lands on a GPU worker is that it needs a PCI BDF for a GPU — and there's a bounded number per host, so the machine gets scheduled only when it demands the resource.
  • Independent security assessments. Fly funded two external audits (Atredis, Tetrel) — both expensive, both time-consuming. See patterns/independent-security-assessment-for-hardware-peripheral.
  • Micro-VM isolation boundary. Each GPU tenant gets its own Cloud Hypervisor VM; no container-level tenancy on GPU hosts.

Trade-offs

  • Utilisation cost is direct. "Those GPU servers were drastically less utilised and thus less cost-effective than our ordinary servers." Dedicated hardware and PCI-BDF bounding together give you worse packing.
  • Audit cost is direct. Two independent audits are five-to-six-figure engagements each. "They were not cheap, and they took time."
  • Indirect cost via driver integration. The security-aware path (micro-VM + PCI passthrough) puts the platform off Nvidia's driver happy path. Fly.io spent months (and ultimately failed) to get virtualized-GPU drivers working on Cloud Hypervisor. The happy-path alternative (K8s with shared kernel) weakens the isolation posture — different tenants share a Linux kernel, which a GPU exploit could pivot through.
  • Thin-slicing becomes hard. NVIDIA MIG / vGPU — the fractional-GPU surface — wants driver-level cooperation that's absent on a micro-VM hypervisor. The security-first path forecloses the thin-slice market segment.

Implications

  • "Just pass through the GPU" is not a small security decision. If the platform is seriously multi-tenant, the decision pulls in dedicated hardware, external audits, months of driver-integration work, and a probably-inaccessible thin-slicing market segment.
  • A shared-kernel K8s GPU cluster trades isolation for driver-compatibility. Many cloud GPU offerings work because they accept this trade. Fly.io's posture is that the trade is wrong for a platform whose customers come for micro-VM isolation.
  • Serious-AI workloads may not care. Customers running single-tenant training clusters aren't paying for multi-tenant isolation. They buy bare-metal or dedicated hosts anyway.

Caveats

  • Nvidia's driver path is not the only hostile surface. The GPU's firmware is a separate attack surface that the platform operator also can't audit. Reset / clean-up of GPU state between tenants is its own problem (not elaborated in the Fly.io post).
  • Peripheral-to-peripheral paths matter beyond the one VM. The NVLink / PCIe-P2P direction is what makes a per-host-isolation posture necessary — without it, a per-VM posture would be enough.
  • Mitigation is not elimination. Dedicated hardware + audits reduce risk; they don't remove the fundamental DMA- peripheral-running-tenant-code exposure.

Seen in (wiki)

Last updated · 200 distilled / 1,178 read