Skip to content

PATTERN Cited by 1 source

Independent security assessment for hardware peripheral

Pattern

Before productising a new class of hardware peripheral on a multi-tenant platform, commission one or more independent external security assessments — from specialist firms, not the vendor of the peripheral or of the hypervisor. Accept the cost (five-to-six figures per engagement, months of elapsed time) as a fixed cost of the productisation, not as optional. Use the assessments' findings to shape the productisation scope, the isolation boundary, and the dedicated-hardware layout.

Canonical instance: Fly.io × Atredis + Tetrel

Fly.io, 2025-02-14:

We funded two very large security assessments, from Atredis and Tetrel, to evaluate our GPU deployment. Matt Braun is writing up those assessments now. They were not cheap, and they took time.

(Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

Two independent firms, parallel engagements. Fly.io's internal security team + both external firms converged on the productisation posture — dedicated host pool + Cloud Hypervisor micro-VM + PCI passthrough — the specific shape the assessments cleared as productisable.

When to use

  • New class of hardware peripheral being attached to a multi-tenant platform. GPU, FPGA, DPU, custom NIC, HSM, peripheral with a proprietary driver, peripheral with firmware you can't audit.
  • Peripheral has DMA or tenant-controlled compute. See concepts/gpu-as-hostile-peripheral. These are the properties that make the peripheral worth an independent audit.
  • Isolation posture is a product claim. If the platform sells per-VM / per-tenant isolation, an exploit via a peripheral is a product-integrity incident, not just a security bug.
  • In-house security expertise doesn't cover the peripheral's domain. Most internal security teams don't have deep Nvidia driver / CUDA / IOMMU / SR-IOV / vGPU expertise. External firms who have seen multiple GPU integrations bring pattern- recognition the internal team doesn't have.

When not to use

  • Well-understood / vendor-audited peripherals. A standard SATA controller or Intel NIC on hardware you've shipped before.
  • Single-tenant deployments. No cross-tenant blast radius to assess against.
  • Too-early product stage. If the peripheral integration doesn't have a productisable shape yet, an audit is premature — audit the shape once it exists.

Structural parts

  • Two independent firms in parallel. Fly.io's specific choice. Different teams find different classes of issues; convergent findings across two firms raise confidence; divergent findings surface gaps.
  • Productisation-blocking scope. The assessments are gating the GA launch, not confirming a launched product.
  • Public acknowledgment. Fly.io discloses the firm names and the commitment to publish writeups. Externalising the audit turns it into a reputational asset (customers trust the product; other operators learn from the writeups).
  • Dedicated hardware + micro-VM isolation + PCI passthrough as the audit target. The audit assesses a specific shape, not GPUs-in-general. Scoping is load-bearing.
  • Cost accepted as a line item. Fly.io frames it explicitly: "They were not cheap, and they took time." The audits weren't a contingency; they were a planned cost.

Trade-offs

Axis Cost Benefit
Calendar Months Higher confidence at GA
Budget Five-to-six figures per engagement Third-party attestation that the shape is defensible
Scope Specific productisation shape Findings feed directly into the productisation decision
Organisational Internal security team must brief + run engagement Internal team builds domain expertise from external-firm collaboration
Disclosure Public writeups require editorial work Reputational uplift for the platform

Architectural neighbours

  • patterns/dedicated-host-pool-for-hostile-peripheral — the hardware-layout pattern the audits scope. Bin-packing choices, isolation boundary, peripheral-to-peripheral mitigation are all audit-input + audit-output.
  • Bug bounties — a different shape. Audits scope a productisation cycle; bounties continuously probe the shipped product. The two stack — audits at GA, bounties ongoing.
  • Compliance / certification programmes (SOC2, FedRAMP, ISO 27001) — different shape. Compliance programmes assess process; this pattern assesses a specific technical integration.

Caveats

  • Clean-audit is not clean-for-all-time. Nvidia driver updates, peripheral-firmware updates, new CVEs, and scale expansion all shift the surface. A single audit clears a specific point-in-time shape, not an ongoing one.
  • Expensive at small scale. An early-stage startup shipping a peripheral product may not be able to afford two parallel five-to-six-figure engagements. Fly.io could; most can't.
  • Firm selection matters. The audit is only as good as the firm. Atredis and Tetrel are both reputable hardware-/low- level-security specialists; a generic pen-test firm might miss GPU-domain classes of issues.
  • "Independent" is not binary. If two firms staff from the same small expert community, the "independence" is narrower than it looks. Some divergence in methodology / toolchain / team composition is part of what makes two-firms-parallel meaningful.
  • Audit cost as a fraction of product spend must make sense. Fly.io's overall GPU-product engineering burn was large; the audit fraction was a sensible slice of it. For a product with a small engineering burn, the same audit cost could be disproportionate.

Known uses

  • Fly.io × Atredis + Tetrel for GPU Fly Machines (canonical wiki instance).
  • The pattern is reportedly common at hyperscalers for new accelerator productisations (Graviton, Inferentia, Trainium, TPU, etc.) — but specific disclosures are rare. Fly.io's public naming of the firms is unusual.

Seen in (wiki)

Last updated · 200 distilled / 1,178 read