CONCEPT Cited by 1 source

Developers want LLMs, not GPUs¶

Definition¶

The demand-side observation that the developer audience for "AI-enabling their app" overwhelmingly wants an LLM API (token in, text out) — not a GPU, not a model, not a framework, not a CUDA runtime. The canonical wiki statement (Fly.io, 2025-02-14):

The biggest problem: developers don't want GPUs. They don't even want AI/ML models. They want LLMs. System engineers may have smart, fussy opinions on how to get their models loaded with CUDA, and what the best GPU is. But software developers don't care about any of that. When a software developer shipping an app comes looking for a way for their app to deliver prompts to an LLM, you can't just give them a GPU.

(Source: sources/2025-02-14-flyio-we-were-wrong-about-gpus)

The three-level abstraction gap¶

GPU. Hardware accelerator. Requires CUDA, driver stack, scheduling, memory management, idle-cost management.
Model. Pre-trained weights + inference runtime. Requires picking the model, loading weights, managing batch size, KV cache, context-window, fine-tuning.
LLM API. HTTP endpoint. POST prompt → stream tokens. The developer thinks in prompts, context, and response text. No GPU, no CUDA, no model choice — or a model choice expressed as a string name.

Fly.io's 2022-era GPU bet sat at level 1 (GPU). The developer market sat at level 3 (LLM API). The gap between those two levels is filled by OpenAI, Anthropic, Replicate, RunPod, Fireworks, etc. — API-over-hosted-inference providers.

Why the gap is stable¶

Token-throughput-vs-latency. Developers think of speed in tokens-per-second, not milliseconds. An LLM API's round-trip latency (hundreds of ms to a frontier model) is dominated by token generation, not network. Sub-ms-ingress locality doesn't move the user-visible number.

"For those developers, who probably make up most of the market, it doesn't seem plausible for an insurgent public cloud to compete with OpenAI and Anthropic. Their APIs are fast enough, and developers thinking about performance in terms of 'tokens per second' aren't counting milliseconds." (sources/2025-02-14-flyio-we-were-wrong-about-gpus)
Model-choice cost. An API provider hosts N models; the developer picks by string. A GPU provider hosts a GPU; the developer builds the inference stack themselves.
Operational cost. GPU = idle-cost exposure. API = per-token cost. For bursty workloads, per-token dominates, and the developer doesn't want to think about it.
Platform scarcity. The frontier-model providers (OpenAI, Anthropic) will not let an insurgent cloud run their models — the most valuable endpoint shapes cannot be replicated, which forecloses competitive entry on the API axis.

Implications¶

Inference-as-GPU-service is a niche for insurgent clouds. Transaction-shape inference where the app developer wants GPU-near compute (not LLM API access) is a smaller market than "developers wanting LLMs". Fly.io's credo: "We design for 10,000 developers, not for 5-6. It took a minute, but the credo wins here: GPU workloads for the 10,001st developer are a niche thing."
The compute-storage-network-locality thesis survives but doesn't drive growth. Fly.io's self-assessment: "We have app servers, GPUs, and object storage all under the same top-of-rack switch. But inference latency just doesn't seem to matter yet, so the market doesn't care."
L40S customer segment persists. For the developers who do want GPU-near compute — video/image/audio pipelines, fine-tuning, custom models, non-LLM AI — the L40S lineup at Fly stays useful. "But they're just another kind of compute that some apps need; they're not a driver of our core business."
"AI-enabling" = API calls, not GPU rental. The Fly.io framing: "for most software developers, 'AI-enabling' their app is best done with API calls to things like Claude and GPT, Replicate and RunPod."

Caveats¶

This is Fly.io's demand-side assessment — it's a learned lesson, not a universal. Serious-AI customers (SXM H100 clusters for training, large-batch inference for product features, custom-model companies) remain a real market; they're just not developer-shaped demand.
The frontier could shift. If frontier-model license terms open up, or if open-weight models (Llama, Mistral, DeepSeek) reach functional parity with closed frontier models, the "host-my-own-model-on-GPU-near-compute" shape could re-expand. Fly.io's 2022-era mental model ("a diversity of mainstream models") was this expectation; the 2025 reality is that Cursor-class interfaces concentrated demand onto closed APIs instead.
"Insurgent cloud"-scoped. This concept is framed against the question "can a small cloud compete at the LLM tier"; the same question for hyperscalers (who host OpenAI, Anthropic as customers) is different.

Seen in (wiki)¶

sources/2025-02-14-flyio-we-were-wrong-about-gpus — Fly.io's canonical demand-side statement; the course-correction post-mortem of the 2022 GPU bet.

concepts/inference-vs-training-workload-shape — the inference shape itself is not in dispute; the question is whether the developer who runs it wants a GPU or an API.
concepts/inference-compute-storage-network-locality — the architectural thesis that survives even as the demand-side doesn't value it.
concepts/insurgent-cloud-constraints — the broader framing: an insurgent cloud can't compete with OpenAI/Anthropic on the API axis.
concepts/product-market-fit — this concept is a demand-side statement about fit.
systems/nvidia-l40s — the GPU SKU that retains a developer-shaped customer segment despite the broader miss.
systems/fly-machines — the compute product whose GPU variant underdelivered against this demand shape.
companies/flyio — canonical wiki source.