Skip to content

PATTERN Cited by 1 source

Dedicated host pool for hostile peripheral

Pattern

Segregate worker hardware so that VMs using a risky hardware peripheral (GPU, FPGA, DPU, custom NIC, HSM) do not share a physical host with VMs that don't use that peripheral. The peripheral is treated as a blast-radius expander — an exploit in the peripheral's driver or firmware can escape the per-VM boundary and reach co-resident tenants on the same host. Dedicating a pool of hosts to peripheral-using tenants bounds that blast radius to other tenants of the same peripheral, accepting a worse bin-packing + lower utilisation cost in exchange for a cleaner isolation story.

Canonical instance: Fly.io GPU Machines

Fly.io, 2025-02-14:

We did a couple expensive things to mitigate the risk. We shipped GPUs on dedicated server hardware, so that GPU- and non-GPU workloads weren't mixed. Because of that, the only reason for a Fly Machine to be scheduled on a GPU machine was that it needed a PCI BDF for an Nvidia GPU, and there's a limited number of those available on any box. Those GPU servers were drastically less utilized and thus less cost-effective than our ordinary servers.

(Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

Every Fly GPU Machine runs on a GPU-only worker host under Intel Cloud Hypervisor. The scheduler only lands a Machine on a GPU worker if it's requesting a PCI BDF for a GPU. General-compute Machines run on Firecracker-based workers, not shared with GPU hosts.

When to use

  • Multi-tenant platform + risky hardware peripheral. DMA-capable peripherals with proprietary drivers or firmware (GPU, FPGA, custom NIC, DPU) whose exploit-surface the platform operator can't fully audit.
  • Per-VM isolation posture is a product claim. The platform promises isolation; regressing to shared-kernel-for-the- peripheral-class contradicts the claim.
  • The peripheral has peripheral-to-peripheral I/O paths. NVLink / PCIe-P2P / other accelerator-to-accelerator fabrics mean a compromised device can attack neighbours; per-host isolation matters, not just per-VM.
  • Security-posture audit is part of the lifecycle. Regulated industries (FedRAMP / IL-tiers, HIPAA), customers with strong-isolation SLAs, or insurance-driven security postures.

When not to use

  • Single-tenant clusters (HPC, bare-metal research). No co-resident tenants to isolate from.
  • Shared-kernel K8s GPU clusters where the cloud vendor has already accepted the shared-kernel trade-off — the pattern isn't applicable; the vendor has chosen a different isolation posture.
  • Peripherals without DMA or without tenant-controlled compute. A plain NIC with firmware that only interprets packets isn't a hostile peripheral in the same sense.

Structural parts

  • Worker-class labels. The platform's scheduler knows which hosts are GPU-enabled. Fly.io's scheduler respects this at placement time.
  • Peripheral-bounded placement. A VM that doesn't request a peripheral is never placed on a peripheral host. A VM that does is only placed on one. The PCI BDF count per host bounds concurrency.
  • Workload-class-separated billing. GPU workers cost more per hour; customers pay the premium when they claim a GPU.
  • Independent security assessment per peripheral class — see patterns/independent-security-assessment-for-hardware-peripheral, the companion process-level pattern.

Trade-offs

Axis Cost Benefit
Bin-packing Worse — GPU host is empty when not fully claimed Non-GPU tenants are never co-resident with GPU exploit surface
Utilisation Lower — Fly.io: "drastically less utilized" Bounded blast radius
Capex Higher per effective vCPU Isolation posture is clean
Placement complexity Scheduler carries workload-class Peripheral class doesn't affect non-peripheral tenants
Upgrade cadence Independent per peripheral class Peripheral-driver-version churn doesn't destabilise general-compute fleet

Known uses

  • Fly.io GPU Machines (canonical wiki instance). GPU-only workers on Cloud Hypervisor; non-GPU workers on Firecracker. 2025-02-14 retrospective disclosed the utilisation cost.
  • AWS P-instances / G-instances — separate instance families from the general-purpose M-class; hardware segregation below the surface is widely assumed but not directly disclosed.
  • Hyperscaler "bare metal" GPU tiers (AWS EC2 Bare Metal, GCP A3-Ultra) — the extreme form of the pattern: single tenant per host, no bin-packing at all.

Architectural neighbours

  • patterns/minimize-vm-permissions — Figma's Lambda sandboxing approach. Same isolation-by-design logic at a different boundary: minimise what the VM can reach, rather than segregate where the VM runs. Composable.
  • concepts/micro-vm-isolation — the per-VM isolation primitive this pattern sits on top of. Hostile-peripheral dedicated-host-pool is the host-level pattern; micro-VM is the VM-level pattern; capability-sandbox is the runtime-level pattern.
  • concepts/gpu-as-hostile-peripheral — the framing this pattern operationalises.

Caveats

  • Doesn't eliminate risk — only bounds it. A GPU-to-GPU exploit on a dedicated GPU host still affects other tenants of that host.
  • Reset/scrub between tenants is a separate problem. GPU state (VRAM, driver state) needs to be cleaned between VMs on the same host; this pattern doesn't solve that.
  • Utilisation cost scales with peripheral density. Few PCI BDFs per host = low density = worse utilisation. More PCI BDFs per host = higher density = better utilisation but more tenants sharing the peripheral-to-peripheral fabric.
  • The pattern bills through. Customers pay the premium for dedicated-host-pool utilisation. For price-sensitive workloads (see concepts/developers-want-llms-not-gpus), this can make the platform uncompetitive vs hyperscalers that have absorbed the utilisation cost at scale.

Seen in (wiki)

Last updated · 200 distilled / 1,178 read