CONCEPT Cited by 3 sources

Inference compute–storage–network locality¶

Definition¶

For transaction-shaped inference workloads, the load-bearing architectural axis is the combination of GPU compute + instance RAM + fast object storage for model parameters/datasets + fast network-to-end-user — all co-located on one platform — not the absolute performance of any single component.

The canonical statement (Fly.io, 2024-08-15):

If you're trying to do something GPU-accelerated in response to an HTTP request, the right combination of GPU, instance RAM, fast object storage for datasets and model parameters, and networking is much more important than getting your hands on an H100.

(Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)

The four axes¶

GPU compute — "capable enough", not the biggest. For a large class of inference workloads, an A10 or L40S hits the threshold where the GPU is no longer the bottleneck.
Instance RAM — "VM instances that have enough memory to actually run real frameworks on". Models, KV caches, and batch queues all want host DRAM; undersized VMs force thrashing back to disk/network.
Fast object storage — model weights, datasets, and artefacts co-resident in the same region as the GPU. Canonically Tigris on Fly.io; generally any object store whose physical bytes live next to the compute.
Fast network to the end user — Anycast-advertised endpoints so the HTTP request lands at the nearest inference POP; the round-trip budget is spent on the model, not on routing.

Why each axis matters (and why combining them matters more)¶

A big GPU with slow storage stalls on weight-loading / dataset fetches. The weights dominate cold-start time and can dominate steady-state if working set > GPU+host memory.
Fast storage without fast network leaves the end-to-end WAN round-trip wide. Transaction-shape means each request pays this round-trip.
Fast network without local storage burns the saved milliseconds on a second hop to pull model bytes cross-cloud.
The combination collapses all three into a single platform's internal fabric.

Fly.io explicitly positions this as the reason hyperscaler-hosted inference underperforms: hyperscalers charge "GPU instance surcharges, and then … egress fees for object storage data when those customers try to outsource the GPU stuff to GPU providers" — surcharges tax axes 1-2, egress taxes axes 3-4 at the boundary between the two clouds, so the combination is economically unviable before it's latency-viable. (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half; complement: concepts/egress-cost)

Contrast with compute–storage separation¶

concepts/compute-storage-separation is the OLAP-shaped story: elastic compute over shared durable storage, because aggregations are the expensive axis and data scans are bandwidth-bound. Inference locality is the opposite shape. The "shared durable storage" axis is still there (Tigris, R2, S3) but a regional-local cache of the object-storage bytes sits next to the GPU and absorbs the per-request reads. The two concepts don't contradict; they point at different workload shapes.

Architectural consequences¶

Platform shape: compute + object storage + edge network are productised as one offer, not three. Fly.io's $1.25/hour L40S + Tigris object storage + Anycast network is the instance. AWS EKS on G-instances + S3 + CloudFront is a more-fragmented equivalent where each axis is billed and operated separately.
Model distribution: model weights are published into the same-region object store; cold-start is a regional read, not a cross-region / cross-cloud read.
Routing: Anycast gets the request to the right POP; the POP has the GPU and the weights.
Pricing: a flat-rate GPU-per-hour dominates the bill; egress does not appear as a cost line at steady state.

Caveats¶

"Capable enough" is workload-dependent. Ultra-large models (Llama 3.1 405B, long-context, large-batch throughput) will pull the GPU axis back toward frontier parts. Fly.io's own list of L40S-fitted workloads (Llama 3.1 70B, Flux, Whisper) has a ceiling; workloads above that ceiling aren't on the L40S.
The thesis is stated, not architected, in the Fly.io post. Per-request hydration mechanics of model weights from Tigris to a Fly Machine's GPU on cold start are not disclosed.
Anycast gets you to a POP, not to a GPU. Within-POP routing from Anycast ingress to a GPU-hosting Machine is additional infrastructure the post doesn't describe.

Seen in (wiki)¶

sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half — canonical statement of the concept.
sources/2024-02-15-flyio-globally-distributed-object-storage-with-tigris — Tigris is the object-storage axis; the two Fly.io posts are complementary halves of the same locality thesis.
sources/2025-02-14-flyio-we-were-wrong-about-gpus — Fly.io's 2025-02 retrospective affirms the thesis but discloses the market hasn't yet valued it. "We have app servers, GPUs, and object storage all under the same top-of- rack switch. But inference latency just doesn't seem to matter yet, so the market doesn't care." Developers shipping inference on AWS tolerate cross-cloud egress to specialist GPU providers because tokens-per-second is the speed axis that matters, not milliseconds. The locality thesis is architecturally correct and technically-deliverable on Fly.io; it just isn't market-decisive for the Fly-shaped insurgent. See concepts/developers-want-llms-not-gpus for the demand-side framing and concepts/insurgent-cloud-constraints for the broader framing.

concepts/inference-vs-training-workload-shape — the shape that creates the locality requirement.
concepts/anycast — the network-locality primitive.
concepts/egress-cost — the shaping force against cross-cloud inference topologies.
concepts/compute-storage-separation — the contrasting OLAP-shape concept.
concepts/cache-locality / concepts/locality-aware-scheduling — adjacent locality primitives.
patterns/co-located-inference-gpu-and-object-storage — the pattern this concept underwrites.
systems/tigris / systems/nvidia-l40s / systems/fly-machines — the three axes of Fly.io's instantiation.