PATTERN Cited by 1 source
Proxy Autostop for GPU Cost Control¶
Proxy autostop for GPU cost control delegates start/stop of an expensive GPU inference Machine to the layer-7 proxy in front of it. The application tier never decides when to wake or stop the GPU; the proxy does — on inbound request (wake) and on configured idle silence (stop). Idle GPUs cost nothing because they are not running; the next legitimate request eats a cold-start tail and proceeds.
Shape¶
- Expensive tier — one or more GPU Machines running the model (Ollama + LLaVA, vLLM + Llama, TGI + Mistral, etc.).
- Proxy — the L7 gateway between the cheap tier and the GPU tier. Owns the start/stop lifecycle.
- Cheap tier — the app server, API gateway, or per-request scheduler that initiates internal RPCs into the GPU tier.
- Trigger semantics:
- Wake on inbound request. Proxy detects a request for a stopped Machine, starts it, buffers / holds the request until the Machine is up (or proxies through a warm replacement if available), proceeds.
- Stop on idle silence. Proxy watches the per-Machine traffic and, if no requests have landed for a configured idle window (minutes-scale), sends a stop to the Machine.
Canonical instance — Fly Proxy on Fly Machines¶
Fly.io's 2024-05-09 image-description walkthrough is the reference instance: an Ollama Machine is stopped when idle and started on internal request via Fly Proxy autostart/autostop. "If there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)
The recipe has three ingredients:
- Flycast to scope access to the GPU tier — patterns/flycast-scoped-internal-inference-endpoint. Without the scope-to-internal-only step, random internet traffic would wake the GPU Machine on every scan.
- Autostart / autostop enabled on the Fly App hosting Ollama. Fly Proxy owns the Machine lifecycle.
- Autostop idle window (default "a few minutes", configurable) — trades tail-latency-on-first-request for idle-hours-saved.
Why this belongs at the proxy layer, not the app layer¶
The proxy is the only component with:
- Full visibility of per-Machine request traffic (so it can detect "idle"), and
- Authority to start/stop Machines (API credentials for the platform's Machine lifecycle).
Putting the decision in the app layer couples every app to the
platform API and scatters stop-decision logic across the fleet.
Putting it in the proxy keeps the app blissfully unaware: it
makes a request to http://ollama.internal:11434, and whether
that Machine was stopped 30 seconds ago or running the whole
time is the proxy's problem.
Trade-offs¶
- The first request after idle pays the
full GPU cold-start
tail. On Fly.io with LLaVA-34b +
a100-40gbthat is ~45 seconds. - Queuing semantics during cold start become load-bearing — what happens to request two that arrives at second 5 of the cold start? Providers resolve this differently (queue behind the first, pre-start a sibling Machine, fail fast).
- Idle window tuning is a direct trade between idle cost and cold-start tail frequency. A 5-minute window is common; shorter saves more money but exposes more users to tails.
- Only works when access is scoped to internal traffic. If the GPU tier is public, every internet scan wakes it — see patterns/flycast-scoped-internal-inference-endpoint.
- Breaks long-running streaming requests if the stop triggers mid-stream. Providers typically hold the stop until connections drain, but this must be verified per platform.
Adjacent patterns¶
- patterns/flycast-scoped-internal-inference-endpoint — the access-scoping step without which autostop can't cleanly distinguish "idle" from "receiving scanner traffic".
- concepts/scale-to-zero — the general principle this pattern realises at the GPU tier with proxy-managed lifecycle.
- systems/aws-lambda — the serverless analogue; difference is Lambda's cold-start tail is driven by runtime init, not model-weight load.
- Cloud Run GPUs / Modal / Runpod serverless each implement a version of this pattern with their own proxy layer.
Seen in¶
- sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description
— Canonical source. Fly Proxy autostart/autostop on an Ollama
Machine behind Flycast. Disclosed ~45 s
cold-start on
a100-40gb+ LLaVA-34b as the tail the pattern chooses to eat.