FLYIO 2024-08-15 Tier 3

We're Cutting L40S Prices In Half¶

Pricing-announcement post for NVIDIA L40S GPUs on Fly.io, cut to $1.25 / hour — the same price as the A10. The substance under the pricing headline is a customer-data-driven retrospective on which GPU shape Fly.io's users actually want and why, and the architectural thesis that falls out of it: inference workloads have a fundamentally different shape from training workloads, and the right infrastructure answer is compute + object-storage + Anycast-network locality, not a bigger GPU.

Summary¶

Fly.io offers four NVIDIA GPU models in increasing order of performance: A10, L40S, A100 40G PCIe, and A100 80G SXM. Fly.io's prior product intuition was that the "biggest GPU problem we could solve was selling fractional A100 slices" — an effort that went through MIG and vGPU experiments via IOMMU PCI passthrough and was eventually abandoned ("a project so cursed that Thomas has forsworn ever programming again"), followed by whole-A100s with NVLink-ganged clusters for training, then chasing H100s. A year later the data showed the least capable GPU in the catalogue — the A10 — was the most popular by a wide margin. The post reframes the product strategy around that datum: the A10 is "capable enough" for inference workloads like Mistral Nemo, Stable Diffusion, and generic GenAI tasks, and cannot be restocked fast enough. The L40S — an AI-optimized L40 (the data-centre RTX 4090, "two 4090s stapled together") — delivers A100-class AI compute at a shape that can now be priced at A10 levels. The price cut is Fly.io's way of "making [the L40S] official" as the default inference GPU, deliberately collapsing the choice between "A10 or something bigger" into a single recommendation.

Key takeaways¶

Customer data overrode product intuition. Fly.io expected fractional-A100 slicing + big-A100 training clusters + H100 scarcity to be the load-bearing GPU problems. Actual usage showed the cheapest, weakest GPU (A10) dominating the mix. "We guessed wrong … the most popular GPU in our inventory is the A10." (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)
Inference and training are different workload shapes, and the shape dictates the infra. "Training workloads tend to look more like batch jobs, and inference tends to look more like transactions. Batch training jobs aren't that sensitive to networking or even reliability. Live inference jobs responding to end-user HTTP requests are." — load-bearing statement for the inference-vs-training-workload-shape concept on this wiki.
For inference, the winning shape is locality, not raw GPU performance. "If you're trying to do something GPU-accelerated in response to an HTTP request, the right combination of GPU, instance RAM, fast object storage for datasets and model parameters, and networking is much more important than getting your hands on an H100." This is the architectural thesis for compute-storage-network locality for inference and the co-located inference GPU + object-storage pattern.
Hyperscaler economics squeeze inference workloads on two axes: "GPU instance surcharges, and then … egress fees for object storage data when those customers try to outsource the GPU stuff to GPU providers." The Fly.io pitch is a platform where GPU compute and Tigris object storage are co-resident so neither bill applies. Links to egress-cost as a shaping force on inference architecture, and to Fly.io's Anycast network as the third locality axis.
The L40S is the "Volkswagen GTI" of the lineup. An L40S is positioned as an A100-class performer for AI workloads on a card designed for a rack (not a tower), with more memory + less power than a 4090, that keeps the full rendering hardware (so it also does 3D / video). "Long story short, the L40S is an A100-performer that we can price for A10 customers." The post lists inference workloads that fit: Llama 3.1 70B, Flux (Black Forest Labs image-gen), Whisper (ASR), SegAlign (whole-genome alignment), and graphics workloads like DOOM Eternal.
Fractional-GPU slicing failed at Fly.io. The 2023-era push to sell fractional A100 slices via NVIDIA MIG or vGPUs through IOMMU PCI passthrough inside Fly Machines was abandoned as unworkable. Meaningful datum for the wiki on how IOMMU PCI-passthrough-based GPU virtualisation interacts with Firecracker-style micro-VM isolation — it doesn't, at least not in a way Fly could productise. The eventual pivot was whole-GPU attach to Fly Machines.
GPU selection is not just "what's biggest". The post enumerates real reasons the L40S sits where it does: it's an L40 (data-centre RTX 4090) with AI-compute uplift, designed for rack density and cooling rather than a tower case's thermal envelope. Relevant operator knob for anyone choosing an inference GPU: rack density, power envelope, and rendering-hardware retention matter alongside raw compute.

Operational numbers¶

L40S price on Fly.io (post-cut): $1.25 / hour — same as the A10 ("the A10 price").
GPU catalogue (Fly.io, as of 2024-08-15): 4 models in increasing AI-compute order — A10, L40S, A100 40G PCIe, A100 80G SXM.
Most popular by a wide margin: A10. Supply is the bottleneck — "we can't get new A10s in fast enough for our users."
L40S positioning: "an A100-performer" for AI workloads (F32/F16 caveat not broken down in the post).
Fractional-GPU history: MIG / vGPU / IOMMU PCI-passthrough ~1 quarter of engineering, abandoned.
Compatible inference workloads named: Llama 3.1 70B, Flux, Whisper, SegAlign, DOOM Eternal.

Extracted systems¶

systems/nvidia-a10 — older-generation NVIDIA GPU, "capable enough" for most inference workloads (Mistral Nemo, Stable Diffusion, mid-sized GenAI). Fly.io's volume leader.
systems/nvidia-l40s — AI-optimized L40 (which is the data-centre version of the RTX 4090). A100-class AI compute on a card designed for rack deployment. Now Fly.io's default inference GPU at $1.25/hr.
systems/nvidia-a100 — covers both variants Fly.io stocks (40G PCIe + 80G SXM). Training-first card at Fly.io; NVLink ganging for distributed-training.
systems/nvidia-h100 — referenced as the scarce-frontier GPU Fly.io was chasing before realising inference customers didn't need it. Extends the existing H100 page with this negative-space datum.
systems/nvidia-mig — NVIDIA's partitioning mechanism for fractional GPU; tried and abandoned at Fly.io.
systems/fly-machines — Fly.io's Firecracker-micro-VM compute primitive. GPUs attach to Machines via whole-device passthrough.
systems/tigris — Fly.io's co-resident object storage (via Tigris Data Inc.); the "object-storage" axis of the inference locality thesis.
Llama 3.1 / Flux / Whisper / SegAlign — named workloads the L40S can host.

Extracted concepts¶

concepts/inference-vs-training-workload-shape — inference = transaction, training = batch. Different sensitivity to networking and reliability.
concepts/inference-compute-storage-network-locality — the combination of GPU + instance RAM + fast object storage + fast network is what wins for inference, not any single axis.
concepts/egress-cost — hyperscaler egress fees as a shaping force pushing customers off the hyperscaler for GPU inference. Extended with this datum.
concepts/anycast — "plugged into an Anycast network that's fast everywhere in the world" is the third locality axis. Extended with this datum.
concepts/training-serving-boundary — the post gives the workload-shape-divergence half of the boundary (training = batch-shape, inference = transaction-shape), complementing the existing compute-convergence framing from Vogels / SageMaker.

Extracted patterns¶

patterns/co-located-inference-gpu-and-object-storage — GPU compute, model-parameter / dataset object storage, and edge network co-resident on one platform. Fly.io × Tigris is the canonical wiki instance.

Caveats¶

This is a pricing post, not an architecture paper. The customer-workload-shape + compute/storage/network-locality thesis is the substantive content; pricing is the framing.
No per-GPU throughput / latency / QPS disclosed for L40S vs A10 vs A100 on the named workloads. "An A100-performer" is as specific as the post gets, with the explicit caveat "without us getting into the details of F32 vs. F16 models."
MIG / vGPU failure is described, not diagnosed. The post doesn't say why IOMMU PCI passthrough + fractional GPU failed inside Fly Machines — only that it did.
No disclosure of L40S memory / TDP / supply numbers, rack density, or inventory mix. No interconnect topology disclosed (L40S is PCIe, not NVLink/NVSwitch — can't be ganged the same way A100/H100 SXM parts can; this is implicit, not stated).
Anycast-to-inference path is claimed, not architected. The post asserts the combination is "pretty killer" but doesn't describe how inference requests are routed from the Anycast edge into a GPU Machine, or how model parameters are hydrated from Tigris on cold-start.
Marketing framing. Tier-3 source; ingested for the workload-shape thesis + fractional-GPU disclosure + locality argument, not for pricing. Consistent with the AGENTS.md Tier-3 rule — the architectural content clears the bar.

Source¶

Original: https://fly.io/blog/cutting-prices-for-l40s-gpus-in-half/
Raw markdown: raw/flyio/2024-08-15-were-cutting-l40s-prices-in-half-a3d993ac.md

companies/flyio — Fly.io company page; this is the GPU / inference-locality datum for it.
systems/tigris — the object-storage half of the locality thesis.
concepts/anycast — the networking half.
concepts/inference-vs-training-workload-shape — the workload-shape concept anchored by this post.
patterns/co-located-inference-gpu-and-object-storage — the pattern derived from this post.