Skip to content

PATTERN Cited by 2 sources

Co-located inference GPU and object storage

Pattern

Run inference GPUs and the object storage that holds their model weights / datasets on the same platform, in the same regions, over the platform's internal fabric — rather than inference on one cloud (e.g. a specialist GPU provider) and object storage on another (e.g. S3 behind CloudFront).

The pattern eliminates two cost + latency boundaries at once:

  1. GPU-instance surcharges on hyperscalers that run GPU VMs but don't specialise in them.
  2. Egress fees paid every time an inference job pulls model weights / datasets cross-cloud.

Combined with Anycast ingress, all three locality axes — GPU + object storage + end-user network — collapse onto one platform.

When to use

  • Transaction-shaped inference (per-HTTP-request model invocations). Sensitivity to WAN round-trips + egress per request is high. (inference-vs-training workload shape)
  • Globally-distributed inference where requests arrive from every geography and the objective is to land the request, the model weights, and the answer all inside one region.
  • Cost-sensitive inference workloads where hyperscaler GPU + egress combined price exceeds the platform's flat rate.

When not to use

  • Training workloads. Batch-shape; egress + WAN round-trips amortise over the job. Better served by hyperscaler GPU capacity
  • cross-cloud dataset replication (see patterns/cross-cloud-replica-cache).
  • Ultra-large-model inference requiring H100 SXM fleets / NVLink-ganged topologies the platform doesn't provide.
  • Workloads that require the hyperscaler's adjacent managed services (e.g. AWS Bedrock, proprietary model APIs) — moving compute away forces an identity / network boundary recrossing.

Canonical instance: Fly.io × Tigris

Fly.io explicitly describes this as "pretty killer":

  • GPU compute: A10 / L40S / A100 / H100 attached to Fly Machines via whole-GPU passthrough.
  • Object storage: Tigris, the Fly.io-native S3-compatible object store whose byte plane is cached on the same Fly.io NVMe volumes that GPU Machines run on.
  • Network: Fly.io's Anycast edge.
  • Billing: one invoice; no per-request egress line item at steady state.

Fly.io's 2024-08-15 L40S price cut to $1.25/hr is the market move that makes this pattern economically the default on the platform, not a power-user configuration. (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)

Structural parts

  • GPU compute primitive — Fly Machine. Whole-GPU passthrough per instance (fractional-GPU path via MIG / vGPU was tried and abandoned on Fly.io — the whole-GPU path is what works.)
  • Object storage primitive — Tigris. Regional-first; metadata on FoundationDB; byte cache on Fly.io NVMe; demand-driven replication via a QuiCK-style queue; pluggable S3 origin.
  • Network primitive — Anycast at the Fly edge; private WireGuard mesh (6PN) between Machines.
  • Integration gluefly storage create as a single command that injects S3-compat secrets so application code speaks to Tigris identically to how it would speak to S3.

Trade-offs vs separation

Axis Co-location (this pattern) Separation (hyperscaler GPU + remote object store)
Per-request cost Flat GPU-hour GPU-hour + per-GB egress
Cold-start weight fetch Regional NVMe read Cross-cloud fetch
Failure-isolation blast radius One platform Two platforms
Supply flexibility Bounded to the platform's GPU fleet Any hyperscaler's inventory
Managed-service access Platform-local only Full hyperscaler catalogue

Known uses

  • Fly.io × Tigris — the canonical wiki instance. Anchored by the 2024-02-15 Tigris public-beta post and the 2024-08-15 L40S price-cut post.
  • Shape-adjacent: hyperscaler bundles such as AWS EC2 G-instances
  • S3 same-region + CloudFront; Azure AKS + Blob + Front Door; GCP GKE + GCS + Cloud CDN. These realize the same pattern in principle but bill the three axes separately and do not publish an "inference locality" narrative around them.
Last updated · 200 distilled / 1,178 read