FLYIO 2024-05-09 Tier 3

Picture This: Open Source AI for Image Description¶

Fly.io developer-facing post by Nolan (Fly Machines team) walking through a weekend-scale open-source image-description service built from Ollama (serving LLaVA, an Apache-licensed multimodal model), PocketBase (Firebase-like SQLite + Go + auth scaffold extended in Go via LangChainGo), and a thin Python client that can run inside the NVDA screen reader. The product framing is accessibility — AI image-descriptions as empowering for a blind internet user who has spent three decades fighting flaky alt-text — but the architectural substance, and the reason this post clears the AGENTS.md Tier-3 filter, is the production GPU scale-to-zero recipe on Fly Machines with a disclosed cold-start number.

Summary¶

The hobby stack is deliberately modular: Ollama on a GPU Fly Machine serves LLaVA inference; PocketBase on a shared-cpu-1x Fly Machine authenticates users, stores chat history in SQLite (on a small persistent volume), and hooks PocketBase collection-event callbacks into Ollama via LangChainGo; a Python client on the user's device uploads an image + followup questions via the PocketBase API. The cost discipline that makes this viable on a cloud GPU is a platform primitive, not an app feature: Fly Proxy's autostart / autostop stops the Ollama Machine after a few idle minutes, and Flycast scopes Ollama access to internal 6PN requests from PocketBase only — users never talk to the GPU Machine directly, so its running state is the proxy's concern, not the client's. The disclosed production cold-start on an a100-40gb preset with 34b-parameter LLaVA is ~45 seconds total: a few seconds to boot the stopped Machine, tens of seconds to load the 34b model into GPU RAM, then several seconds per generated response. Post also lists two deployment choices for the Ollama model payload: persistent volume (model stored on NVMe, re-mounted across starts) or bake the model into the Docker image (larger image but no runtime fetch).

Key takeaways¶

GPU scale-to-zero via Fly Proxy autostart/autostop is the load-bearing cost mechanism. "GPU compute is expensive! It's important to take steps to ensure you're not paying for a massive GPU 24/7." Fly.io's recipe: restrict Ollama access to internal Flycast requests from the PocketBase app, then enable autostart / autostop on the Fly Proxy so "if there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." Canonical wiki instance of patterns/proxy-autostop-for-gpu-cost-control — idle GPU Machines cost nothing because the proxy stops them on silence and starts them on inbound request. (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)
Disclosed cold-start number: ~45 seconds on a100-40gb with 34b LLaVA. "On the a100-40gb Fly Machine preset, the 34b-parameter LLaVA model took several seconds to generate each response. If the Machine was stopped when the request came in, starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds." Decomposition of the ~45 s budget: Machine-start (seconds) + model-load-into-GPU-RAM (tens of seconds) + per-response-generation (several seconds). Canonical wiki instance of concepts/gpu-scale-to-zero-cold-start with a real number attached — prior concepts/cold-start + concepts/scale-to-zero pages had mostly CPU-class / serverless framing; this entry anchors the GPU-inference-specific tail. (Source: same)
Flycast scopes Ollama to internal-only — so users don't keep the GPU busy directly. "It's important to take steps to ensure you're not paying for a massive GPU 24/7. On Fly.io, at the time of writing, you'd achieve this with the autostart and autostop functions of the Fly Proxy, restricting Ollama access to internal requests over Flycast from the PocketBase app." The access-scoping is structural, not just polite: if the Ollama endpoint were public, every curl on the internet would wake the GPU Machine; behind Flycast the only way to send traffic is from another Machine on the same Fly org's WireGuard mesh — so "idle" is well-defined. Canonical wiki instance of patterns/flycast-scoped-internal-inference-endpoint. (Source: same)
PocketBase-as-app-server layer: SQLite-backed Firebase clone with Go extensibility + event hooks. The business logic — auth, per-user API rules so one user can't read another user's chats, a followups collection chaining prior messages for context, hook-triggered calls into Ollama — is written as a Go extension of PocketBase using collection event hooks and API rules. The PocketBase binary is a single static Go executable on a small persistent volume, running happily on shared-cpu-1x alongside the expensive GPU tier. Not a novel architecture, but representative of the *cheap-frontend + expensive-stoppable-GPU
idempotent-internal-RPC* partition Fly's autostart model rewards. (Source: same)
Model payload: persistent volume vs. baked-into-Docker-image. "If you're running Ollama in the cloud, you likely want to put the model onto storage that's persistent, so you don't have to download it repeatedly. You could also build the model into a Docker image ahead of deployment." Two options, both covered by Fly.io's existing primitives: (a) Fly Volume holding the model weights — re-mounted across Machine starts, pays for storage but not re-fetch; (b) larger Docker image with the model inside — no runtime fetch, pulls a larger rootfs on each Machine creation. The post doesn't numerically compare — just flags both as valid. (Source: same)
Context-window blow-out is the limiter on the simple followup-question design. "This is a super simple hack to handle followup questions, and it'll let you keep adding followups until something breaks. You'll see the quality of responses get poorer — possibly incoherent — as the context exceeds the context window." Honest note on the naive chain-prior-messages approach — no summarisation, no windowing, no retrieval. An architectural caveat stated in the post itself, not a hidden limit. (Source: same)
Modularity framing: swap the model and prompt, get a different service. "If sentiment analysis or joke creation is your thing, you can swap out image description for that and have something going in, like, a weekend. … If image descriptions aren't your thing, this business logic is easily swappable for joke generation, extracting details from text, any other simple task you might want to throw at an LLM. Just slot the best model into Ollama (LLaVA is pretty OK as a general starting point too), and match the PocketBase schema and pre-set prompts to your application." Post's explicit design claim: the PocketBase-schema
hook-chain + Ollama-model triple is a template, not a committed architecture. (Source: same)

Architectural details¶

Stack (three processes on two kinds of Machine)¶

GPU tier — a100-40gb Fly Machine. Runs the Ollama Docker image with LLaVA-34b loaded into GPU RAM on first request. Behind Flycast — reachable only from the PocketBase app on the same Fly org. Fly Proxy stops it after configured idle; the next internal request wakes it.
Control-plane tier — shared-cpu-1x Fly Machine with a small persistent volume. PocketBase single-binary Go process; stores users, sessions, images and followups collections in SQLite on the volume. Extended in Go with LangChainGo-mediated Ollama calls wired to collection event hooks.
Client tier — Python script on user's laptop / screen reader. Uses the PocketBase Python SDK to upload image.jpg and exchange followup text; speaks only to PocketBase, never to Ollama directly (that's the point of the Flycast scoping).

Hook flow (image upload → description → followups)¶

User uploads an image via the Python client → PocketBase images collection receives a new record.
PocketBase collection hook fires → the Go extension sends the image to the Ollama Flycast endpoint with the system prompt "You are a helpful assistant describing images for blind screen reader users. Please describe this image."
Ollama's response is stored in the followups collection and returned via the PocketBase API to the Python client.
A user-initiated followup message written to followups fires another hook → the Go extension chains the new question with prior context into a new Ollama request.
Loop until followups grows past the context window and answers degrade (stated limit, see takeaway 6).

Concrete numbers (disclosed in post)¶

GPU preset used: a100-40gb (Fly.io's A100 40G SKU).
Model size: LLaVA 34b parameters.
Cold-start latency on a stopped Machine: ~45 seconds total — "several seconds" Machine start + "several tens of seconds" model-load + "several seconds" response generation.
Warm response latency: "several seconds" per response.
PocketBase Machine: shared-cpu-1x — the cheapest Fly shared-CPU preset; deliberate cheap/stateless front-end to an expensive stoppable GPU back-end.

Numbers not disclosed¶

No warm-start number broken out from the 45-second cold total (only "several seconds per response" qualitative).
No dollar-per-request or per-month cost; only the qualitative claim that stopping the GPU "saves you a bunch of money and some carbon footprint".
No autostop idle-threshold value (post says "a few minutes"); configurable via Fly Proxy settings not quoted inline.
No QPS ceiling before cold starts become user-facing (single-user hobby deployment).
No context-window limit number for the LLaVA model chosen; post describes the degradation qualitatively.
No concurrency / request-queuing behaviour under the autostart-from-stopped path (what happens to two requests that land during a cold start).
No model-image size numbers for the Docker-image-with-model option vs. volume option — size / rebuild-time tradeoff not quantified.
No PocketBase scaling guidance beyond "runs fine on shared-cpu-1x Machine".

Caveats¶

Product-adjacent voice, weekend-project scope. Post is a hobby-build walkthrough, not a production-incident retrospective or a platform deep-dive. Author is on the Fly Machines team but the post is consumer-developer-facing. Architectural substance is the GPU scale-to-zero recipe + cold-start number; everything else is stack walkthrough.
Accessibility framing is the hook. The product value (AI-generated alt text for blind users; links to Be My AI and Seeing AI as existing products; NVDA add-on AI-content-describer) is real and the author's lived motivation, but the wiki-scope content is the Fly platform recipe.
LLaVA quality disclosed modestly. "Is it a stellar description? Maybe not." The demo couldn't identify leafless tree species. Model-quality claims are not a takeaway — the takeaway is the infra recipe.
Prompt / context management is naive. No summarisation, no RAG, no windowing — the followup loop eventually blows the context window (stated). A production build would need concepts/agent-memory or at least a summarising step.
Modularity claim is platform-level not workload-level. The "swap the model for jokes / sentiment" framing is accurate at the PocketBase-schema-plus-hook level but doesn't address per-workload tuning (quantisation choice, GPU sizing, context size, concurrency).
Not a comparison with alternative hosting paths. Post doesn't benchmark Fly's cold-start against self-hosted on-prem GPU, Runpod, Modal, Replicate, AWS SageMaker async endpoints, etc. — it's a walkthrough, not a competitive analysis.

Relationship to existing wiki¶

Extends concepts/scale-to-zero with a GPU-inference instance that has a real cold-start number — prior instances on that page (Lambda tenet, Cloudflare Artifacts storage-tier, Livebook/FLAME GPU cluster) span serverless-compute, storage, and notebook-driven GPU; this adds the proxy-managed always-present-but-stopped single-Machine inference shape.
Extends systems/fly-proxy with the autostart/autostop role, previously covered only as a FKS Service implementer.
Extends systems/flycast with the internal-only inference-endpoint usage pattern — a private-networking scope discipline that in turn makes the autostop model well-defined.
Extends systems/fly-machines with the explicit a100-40gb preset name + stopped-Machine semantics + cold-start timing decomposition.
Extends systems/fly-volumes with the "hold model weights to avoid re-downloading on cold start" use case — a different use shape than the stateful-app-data framing from the 2024-07-30 Making Machines Move post.
Extends systems/nvidia-a100 with a concrete customer hobby workload + cold-start number on the a100-40gb SKU.
New concepts/gpu-scale-to-zero-cold-start captures the three-component cold-start budget (Machine-start + model-load-into-GPU-RAM + first-response) as a reusable concept — applies equally to Runpod serverless, Modal, Replicate, SageMaker async endpoints, Cloud Run GPUs.
New patterns/proxy-autostop-for-gpu-cost-control captures the idle-stop-on-silence / wake-on-request pattern with proxy ownership of Machine lifecycle.
New patterns/flycast-scoped-internal-inference-endpoint captures the private-networking-scoped-inference shape that makes the autostop model well-defined.

Source¶

Original: https://fly.io/blog/llm-image-description/
Raw markdown: raw/flyio/2024-05-09-picture-this-open-source-ai-for-image-description-3b1569d4.md

companies/flyio — eighth Fly.io ingest.
systems/fly-machines — the compute primitive.
systems/fly-proxy — owns the autostart/autostop lifecycle.
systems/flycast — scopes Ollama to internal-only.
systems/fly-volumes — option (a) for model weight storage.
systems/nvidia-a100 — a100-40gb preset used in the demo.
concepts/scale-to-zero — sibling GPU-inference instance.
concepts/cold-start — parent concept.
concepts/gpu-scale-to-zero-cold-start — new, canonical.
patterns/proxy-autostop-for-gpu-cost-control — new, the recipe.
patterns/flycast-scoped-internal-inference-endpoint — new.
sources/2024-09-24-flyio-ai-gpu-clusters-from-your-laptop-with-livebook — sibling Fly.io GPU post: notebook-driven cluster scale-to-zero via FLAME vs. single-Machine proxy-autostop here.
sources/2024-08-15-flyio-were-cutting-l40s-prices-in-half — sibling Fly.io GPU post: pricing / SKU strategy behind the A100.
sources/2024-07-30-flyio-making-machines-move — Fly Volume framing and the migration-anchoring caveat that the model-on-volume option inherits.