CONCEPT Cited by 1 source
GPU as hostile peripheral¶
Definition¶
The security-posture framing that treats a consumer/enterprise GPU as the worst-case hardware peripheral for a multi-tenant platform. The canonical wiki statement (Fly.io, 2025-02-14):
GPUs [terrified our security team]. A GPU is just about the worst case hardware peripheral: intense multi-directional direct memory transfers … with arbitrary, end-user controlled computation, all operating outside our normal security boundary.
(not even bidirectional: in common configurations, GPUs talk to each other)
The framing is a deliberate generalisation: GPU is instance-zero, but the same argument applies to any PCI-passthrough accelerator (FPGA, TPU, custom NIC, cryptographic HSM) with DMA capability and an Internet-exposed driver / firmware surface.
The four properties that make a GPU hostile¶
- DMA-capable. The peripheral reads/writes host memory directly. An exploit in the driver or firmware translates to arbitrary kernel memory access — i.e. escape from the VM.
- End-user-controlled computation. The tenant ships arbitrary CUDA / shader code to the device. Unlike, say, a NIC (where the firmware interprets packets from a limited protocol) or a disk controller (where the firmware interprets SCSI / NVMe commands), the tenant's compute runs on the peripheral.
- Multi-directional peripheral-to-peripheral I/O. GPUs NVLink/NVSwitch/PCIe-P2P talk to each other on the host. A compromised device can attack neighbouring GPUs on the same host, widening the blast radius beyond the per-VM boundary.
- Closed-source, rapidly-evolving driver/firmware surface. Nvidia's driver stack is proprietary, large, and ships new features (and CVEs) on a regular cadence. The platform operator can't audit the source.
Google Project Zero's 2020 "Attacking the Qualcomm Adreno GPU" series is the explicit reference Fly.io links — a class of well-documented GPU-driver exploits.
Mitigation patterns¶
Fly.io's posture on top of Intel Cloud Hypervisor micro-VMs:
- Dedicated host pool. GPU workers do not run non-GPU tenants. "Fly Machines [not assigned a GPU] weren't mixed" on the GPU workers. Reduces the blast radius to "GPU tenants on this host" rather than "all tenants on this host". See patterns/dedicated-host-pool-for-hostile-peripheral.
- PCI-BDF-bounded placement. The only reason a Fly Machine lands on a GPU worker is that it needs a PCI BDF for a GPU — and there's a bounded number per host, so the machine gets scheduled only when it demands the resource.
- Independent security assessments. Fly funded two external audits (Atredis, Tetrel) — both expensive, both time-consuming. See patterns/independent-security-assessment-for-hardware-peripheral.
- Micro-VM isolation boundary. Each GPU tenant gets its own Cloud Hypervisor VM; no container-level tenancy on GPU hosts.
Trade-offs¶
- Utilisation cost is direct. "Those GPU servers were drastically less utilised and thus less cost-effective than our ordinary servers." Dedicated hardware and PCI-BDF bounding together give you worse packing.
- Audit cost is direct. Two independent audits are five-to-six-figure engagements each. "They were not cheap, and they took time."
- Indirect cost via driver integration. The security-aware path (micro-VM + PCI passthrough) puts the platform off Nvidia's driver happy path. Fly.io spent months (and ultimately failed) to get virtualized-GPU drivers working on Cloud Hypervisor. The happy-path alternative (K8s with shared kernel) weakens the isolation posture — different tenants share a Linux kernel, which a GPU exploit could pivot through.
- Thin-slicing becomes hard. NVIDIA MIG / vGPU — the fractional-GPU surface — wants driver-level cooperation that's absent on a micro-VM hypervisor. The security-first path forecloses the thin-slice market segment.
Implications¶
- "Just pass through the GPU" is not a small security decision. If the platform is seriously multi-tenant, the decision pulls in dedicated hardware, external audits, months of driver-integration work, and a probably-inaccessible thin-slicing market segment.
- A shared-kernel K8s GPU cluster trades isolation for driver-compatibility. Many cloud GPU offerings work because they accept this trade. Fly.io's posture is that the trade is wrong for a platform whose customers come for micro-VM isolation.
- Serious-AI workloads may not care. Customers running single-tenant training clusters aren't paying for multi-tenant isolation. They buy bare-metal or dedicated hosts anyway.
Caveats¶
- Nvidia's driver path is not the only hostile surface. The GPU's firmware is a separate attack surface that the platform operator also can't audit. Reset / clean-up of GPU state between tenants is its own problem (not elaborated in the Fly.io post).
- Peripheral-to-peripheral paths matter beyond the one VM. The NVLink / PCIe-P2P direction is what makes a per-host-isolation posture necessary — without it, a per-VM posture would be enough.
- Mitigation is not elimination. Dedicated hardware + audits reduce risk; they don't remove the fundamental DMA- peripheral-running-tenant-code exposure.
Seen in (wiki)¶
- sources/2025-02-14-flyio-we-were-wrong-about-gpus — Fly.io's canonical "worst case hardware peripheral" framing; enumerates the mitigations and their costs.
Related¶
- concepts/nvidia-driver-happy-path — the trade-off on the other axis: fast-path driver support vs isolation posture.
- concepts/micro-vm-isolation — the isolation primitive that the hostile-peripheral posture builds on.
- systems/firecracker — without PCI passthrough; safer but can't run GPUs.
- systems/intel-cloud-hypervisor — with PCI passthrough; enables GPU integration at the cost of Nvidia-driver off-path.
- systems/nvidia-mig — the thin-slicing primitive the hostile-peripheral posture forecloses.
- patterns/dedicated-host-pool-for-hostile-peripheral — the hardware-layout pattern.
- patterns/independent-security-assessment-for-hardware-peripheral — the process pattern.
- companies/flyio — canonical wiki source.