Skip to content

PATTERN Cited by 1 source

Proxy Autostop for GPU Cost Control

Proxy autostop for GPU cost control delegates start/stop of an expensive GPU inference Machine to the layer-7 proxy in front of it. The application tier never decides when to wake or stop the GPU; the proxy does — on inbound request (wake) and on configured idle silence (stop). Idle GPUs cost nothing because they are not running; the next legitimate request eats a cold-start tail and proceeds.

Shape

  • Expensive tier — one or more GPU Machines running the model (Ollama + LLaVA, vLLM + Llama, TGI + Mistral, etc.).
  • Proxy — the L7 gateway between the cheap tier and the GPU tier. Owns the start/stop lifecycle.
  • Cheap tier — the app server, API gateway, or per-request scheduler that initiates internal RPCs into the GPU tier.
  • Trigger semantics:
  • Wake on inbound request. Proxy detects a request for a stopped Machine, starts it, buffers / holds the request until the Machine is up (or proxies through a warm replacement if available), proceeds.
  • Stop on idle silence. Proxy watches the per-Machine traffic and, if no requests have landed for a configured idle window (minutes-scale), sends a stop to the Machine.

Canonical instance — Fly Proxy on Fly Machines

Fly.io's 2024-05-09 image-description walkthrough is the reference instance: an Ollama Machine is stopped when idle and started on internal request via Fly Proxy autostart/autostop. "If there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)

The recipe has three ingredients:

  1. Flycast to scope access to the GPU tier — patterns/flycast-scoped-internal-inference-endpoint. Without the scope-to-internal-only step, random internet traffic would wake the GPU Machine on every scan.
  2. Autostart / autostop enabled on the Fly App hosting Ollama. Fly Proxy owns the Machine lifecycle.
  3. Autostop idle window (default "a few minutes", configurable) — trades tail-latency-on-first-request for idle-hours-saved.

Why this belongs at the proxy layer, not the app layer

The proxy is the only component with:

  • Full visibility of per-Machine request traffic (so it can detect "idle"), and
  • Authority to start/stop Machines (API credentials for the platform's Machine lifecycle).

Putting the decision in the app layer couples every app to the platform API and scatters stop-decision logic across the fleet. Putting it in the proxy keeps the app blissfully unaware: it makes a request to http://ollama.internal:11434, and whether that Machine was stopped 30 seconds ago or running the whole time is the proxy's problem.

Trade-offs

  • The first request after idle pays the full GPU cold-start tail. On Fly.io with LLaVA-34b + a100-40gb that is ~45 seconds.
  • Queuing semantics during cold start become load-bearing — what happens to request two that arrives at second 5 of the cold start? Providers resolve this differently (queue behind the first, pre-start a sibling Machine, fail fast).
  • Idle window tuning is a direct trade between idle cost and cold-start tail frequency. A 5-minute window is common; shorter saves more money but exposes more users to tails.
  • Only works when access is scoped to internal traffic. If the GPU tier is public, every internet scan wakes it — see patterns/flycast-scoped-internal-inference-endpoint.
  • Breaks long-running streaming requests if the stop triggers mid-stream. Providers typically hold the stop until connections drain, but this must be verified per platform.

Adjacent patterns

  • patterns/flycast-scoped-internal-inference-endpoint — the access-scoping step without which autostop can't cleanly distinguish "idle" from "receiving scanner traffic".
  • concepts/scale-to-zero — the general principle this pattern realises at the GPU tier with proxy-managed lifecycle.
  • systems/aws-lambda — the serverless analogue; difference is Lambda's cold-start tail is driven by runtime init, not model-weight load.
  • Cloud Run GPUs / Modal / Runpod serverless each implement a version of this pattern with their own proxy layer.

Seen in

Last updated · 200 distilled / 1,178 read