SYSTEM Cited by 4 sources
NVIDIA L40S¶
The NVIDIA L40S is the AI-optimized variant of the L40, which is itself the data-centre edition of the GeForce RTX 4090 gaming GPU — "resembling two 4090s stapled together", per Fly.io's framing. The L40S delivers AI-compute performance "comparable to that of the A100" (Fly.io's summary, with an explicit caveat that F32 vs F16 comparisons differ), while retaining the full rendering pipeline and gaming-card cost base.
Seen in (wiki)¶
- Fly.io 2024-08-15 — "Volkswagen GTI" framing. Fly.io cut L40S pricing to $1.25/hour — the same price as the A10 — and made the L40S the default recommendation for inference. Named workloads: Llama 3.1 70B, Flux (Black Forest Labs image-gen), Whisper (ASR), SegAlign (whole-genome alignment), DOOM Eternal (showcasing retained graphics hardware). (Source: sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half)
- Fly.io 2024-09-24 — 64-node hyperparameter-tuning cluster. ElixirConf 2024 keynote demo (recapped by Fly.io): Chris Grainger (Amplified) generates a cluster of 64 L40S Fly Machines, compiles a different BERT variant on each, fine-tunes on the same patent corpus, and streams per-node loss curves back to a Livebook in real time — driven by FLAME + the Nx stack. Cluster terminates on notebook disconnect. Fly.io's platform claim: "start a cluster of GPUs in seconds rather than minutes, and all it requires is a Docker image" (concepts/seconds-scale-gpu-cluster-boot). (Source: sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook)
- Fly.io 2025-02-14 — "the L40S customer segment persists." In Fly.io's We Were Wrong About GPUs retrospective, the L40S is named as the one SKU that found a developer-shaped product-market fit in Fly's GPU inventory. "That leaves the L40S customers. There are a bunch of these! We dropped L40S prices last year, not because we were sour on GPUs but because they're the one part we have in our inventory people seem to get a lot of use out of. We're happy with them. But they're just another kind of compute that some apps need; they're not a driver of our core business. They're not the GPU bet paying off." The L40S is the customer base that Fly.io's retrenchment protects — forward investment pauses, but existing workloads and pricing stay. (Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)
- Databricks × Superhuman 2026-05-08 — L40S as the pre-migration GPU class for 200K-QPS LLM serving. Joint Databricks Model Serving / Superhuman post documents the L40S as the GPU class that Superhuman's pre- migration DIY vLLM serving stack ran on at 200,000+ QPS peak with sub-1-second p99 — the largest publicly disclosed L40S inference deployment in the wiki. The L40S stack "supported a massive scale, but several pain points were compounding when serving large language models": each new model iteration "required months of manual performance tuning to onboard"; the lean ML-platform team carried capacity planning, performance tuning, and autoscaling burden. The migration off L40S onto H100 delivered per-pod throughput of 1,200 QPS post-optimisation (vs 750 QPS pre-optimisation on H100), driven primarily by H100's Transformer Engine FP8 path. The post is careful to note: "Some of the throughput improvement attributed to software optimisations is enabled by the H100's faster Transformer-Engine FP8 path; the L40S baseline is not separately re-tested with the new optimisations." This is the wiki's canonical datum that L40S is sufficient as a 200K-QPS LLM serving substrate but the ceiling on per-pod throughput is hardware-bounded — the H100 Transformer Engine FP8 path is what unlocks the 60% per-pod improvement, not the software changes alone. Complement to the Fly.io L40S framings: where Fly.io positions L40S as the developer-grade inference SKU, Superhuman's deployment shows the same hardware class running at 200K+ QPS production scale before hitting the FP8-throughput ceiling that motivates the H100 upgrade. (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)
Why it matters¶
- An A100-class inference card at a gaming-GPU cost basis. The L40S inherits the RTX 4090's core design (Ada Lovelace), so NVIDIA can manufacture at high volume against a consumer-card BOM while charging an enterprise markup — a structurally different economic shape from the HBM-based A100 / H100.
- Data-centre form factor. L40-family is designed for rack power/cooling envelopes, not tower PCs. Higher memory than a 4090, lower TDP, denser pack — the design changes Fly.io enumerates when explaining why a 4090 "sucks in a data center rack".
- Kept the rendering hardware. Unlike the A100/H100 (AI-only compute cards), the L40S retains the full rasterisation pipeline. Usable for 3D graphics + video processing workloads that a pure compute card can't serve. Fly.io's pitch that a customer could "build the Stadia that Google couldn't pull off" is a functionality play on this.
- No NVLink / NVSwitch. The L40S is PCIe-only — it cannot be ganged into the tightly-coupled multi-GPU training domains that A100 / H100 SXM parts form. That is not a limitation for inference, which is exactly why the L40S works as the inference default while the SXM-class parts remain the training default.
Architectural position (per Fly.io)¶
"Long story short, the L40S is an A100-performer that we can price for A10 customers; the Volkswagen GTI of our lineup." The pricing move is engineered to collapse the choice between "A10 or step up to something bigger" into a single default. Fly.io's broader thesis is that for inference, the load-bearing axis is compute-storage-network locality — GPU + instance RAM + Tigris object storage + Anycast network — not the GPU alone.
Related¶
- systems/nvidia-a10 — price anchor; L40S now at A10 price.
- systems/nvidia-a100 — AI-compute baseline the L40S claims parity with.
- systems/nvidia-h100 — frontier part; L40S is an explicit downmarket-for-inference alternative.
- systems/fly-machines — L40S attaches to a Fly Machine via whole-GPU passthrough.
- systems/llama-3-1 — named workload the L40S serves.
- concepts/inference-vs-training-workload-shape — why an inference-shaped card (PCIe, graphics-retained, modest interconnect) is the right shape.
- patterns/co-located-inference-gpu-and-object-storage — Fly.io's L40S + Tigris architectural pitch.