PATTERN Cited by 2 sources
Co-located inference GPU and object storage¶
Pattern¶
Run inference GPUs and the object storage that holds their model weights / datasets on the same platform, in the same regions, over the platform's internal fabric — rather than inference on one cloud (e.g. a specialist GPU provider) and object storage on another (e.g. S3 behind CloudFront).
The pattern eliminates two cost + latency boundaries at once:
- GPU-instance surcharges on hyperscalers that run GPU VMs but don't specialise in them.
- Egress fees paid every time an inference job pulls model weights / datasets cross-cloud.
Combined with Anycast ingress, all three locality axes — GPU + object storage + end-user network — collapse onto one platform.
When to use¶
- Transaction-shaped inference (per-HTTP-request model invocations). Sensitivity to WAN round-trips + egress per request is high. (inference-vs-training workload shape)
- Globally-distributed inference where requests arrive from every geography and the objective is to land the request, the model weights, and the answer all inside one region.
- Cost-sensitive inference workloads where hyperscaler GPU + egress combined price exceeds the platform's flat rate.
When not to use¶
- Training workloads. Batch-shape; egress + WAN round-trips amortise over the job. Better served by hyperscaler GPU capacity
- cross-cloud dataset replication (see patterns/cross-cloud-replica-cache).
- Ultra-large-model inference requiring H100 SXM fleets / NVLink-ganged topologies the platform doesn't provide.
- Workloads that require the hyperscaler's adjacent managed services (e.g. AWS Bedrock, proprietary model APIs) — moving compute away forces an identity / network boundary recrossing.
Canonical instance: Fly.io × Tigris¶
Fly.io explicitly describes this as "pretty killer":
- GPU compute: A10 / L40S / A100 / H100 attached to Fly Machines via whole-GPU passthrough.
- Object storage: Tigris, the Fly.io-native S3-compatible object store whose byte plane is cached on the same Fly.io NVMe volumes that GPU Machines run on.
- Network: Fly.io's Anycast edge.
- Billing: one invoice; no per-request egress line item at steady state.
Fly.io's 2024-08-15 L40S price cut to $1.25/hr is the market move that makes this pattern economically the default on the platform, not a power-user configuration. (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)
Structural parts¶
- GPU compute primitive — Fly Machine. Whole-GPU passthrough per instance (fractional-GPU path via MIG / vGPU was tried and abandoned on Fly.io — the whole-GPU path is what works.)
- Object storage primitive — Tigris. Regional-first; metadata on FoundationDB; byte cache on Fly.io NVMe; demand-driven replication via a QuiCK-style queue; pluggable S3 origin.
- Network primitive — Anycast at the Fly edge; private WireGuard mesh (6PN) between Machines.
- Integration glue —
fly storage createas a single command that injects S3-compat secrets so application code speaks to Tigris identically to how it would speak to S3.
Trade-offs vs separation¶
| Axis | Co-location (this pattern) | Separation (hyperscaler GPU + remote object store) |
|---|---|---|
| Per-request cost | Flat GPU-hour | GPU-hour + per-GB egress |
| Cold-start weight fetch | Regional NVMe read | Cross-cloud fetch |
| Failure-isolation blast radius | One platform | Two platforms |
| Supply flexibility | Bounded to the platform's GPU fleet | Any hyperscaler's inventory |
| Managed-service access | Platform-local only | Full hyperscaler catalogue |
Known uses¶
- Fly.io × Tigris — the canonical wiki instance. Anchored by the 2024-02-15 Tigris public-beta post and the 2024-08-15 L40S price-cut post.
- Shape-adjacent: hyperscaler bundles such as AWS EC2 G-instances
- S3 same-region + CloudFront; Azure AKS + Blob + Front Door; GCP GKE + GCS + Cloud CDN. These realize the same pattern in principle but bill the three axes separately and do not publish an "inference locality" narrative around them.
Related¶
- concepts/inference-vs-training-workload-shape — why the co-location matters.
- concepts/inference-compute-storage-network-locality — the concept this pattern instantiates.
- concepts/anycast / concepts/egress-cost — the network / cost axes.
- patterns/metadata-db-plus-object-cache-tier — Tigris's internal shape.
- patterns/partner-managed-service-as-native-binding — how Tigris is surfaced as a first-party Fly.io primitive.