PATTERN Cited by 1 source

Flycast-Scoped Internal Inference Endpoint¶

Flycast-scoped internal inference endpoint is the Fly.io instantiation of the general pattern: expose an inference service only to internal, private-network callers, never to the public internet, so that "idle" is well-defined and autostop decisions are meaningful.

Shape¶

Inference service (Ollama, vLLM, TGI, custom model server) runs on a GPU Fly Machine.
Fly App hosting the service is not assigned a public anycast IP / public Fly Proxy edge. Instead, access is via Flycast — an internal-only hostname (<app>.flycast or <app>.internal) reachable only over the Fly org's systems/fly-wireguard-mesh|6PN WireGuard mesh.
Consumers are other Fly Machines in the same org — typically an app-server / API gateway tier that translates user-facing requests into internal inference calls.

Why access-scoping matters for GPU scale-to-zero¶

With a public endpoint on a GPU Machine participating in proxy-autostop:

Any internet scan wakes the GPU. Shodan / Censys-style probes, curl-based benchmarks, credential-stuffing attempts, CVE scanners — all count as "traffic" from the proxy's point of view.
"Idle" ceases to be well-defined. If the proxy reads any request as activity, a Machine under routine internet background noise is never idle — defeats the cost model.
Attack surface for unauthenticated inference. LLM inference endpoints without auth are attractive targets (token-burning-as-DoS, prompt-extraction attacks, proxying abuse).

Scoping to Flycast eliminates all three: the only path into the GPU Machine is from app servers on the same WireGuard mesh, which are themselves authenticated and rate-limited at their own public edge.

Canonical instance — Fly.io image-description service¶

Fly.io's 2024-05-09 walkthrough: "On Fly.io, at the time of writing, you'd achieve this with the autostart and autostop functions of the Fly Proxy, restricting Ollama access to internal requests over Flycast from the PocketBase app." Ollama behind Flycast; PocketBase on the app tier reaches it via the internal Flycast URL. Users interact only with PocketBase; the GPU is not even addressable from their devices. (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)

Composition with other patterns¶

Upstream pairing: patterns/proxy-autostop-for-gpu-cost-control. Flycast makes autostop's "idle" definition meaningful; autostop makes the internal-scoping economically valuable (without it, you'd scope and still pay 24/7 for an idle GPU).
Sibling on the public-facing edge: the Fly App serving human users handles auth, rate-limiting, per-user quotas, and translation to internal RPCs. Canonical cheap-frontend / expensive-stoppable-backend split.
Not specific to GPU. Same pattern applies to any expensive-to-host managed-service Machine where public exposure would cause unwanted cost or risk — e.g. internal search indexes, admin tools, data-export workers.

Generalisation beyond Fly.io¶

AWS: VPC endpoint / PrivateLink into an internal Network Load Balancer fronting the GPU tier; equivalent scoping via security-group-based access.
Kubernetes: ClusterIP Service (no LoadBalancer) + a NetworkPolicy restricting ingress to app-tier pods.
Cloud Run: --no-allow-unauthenticated + internal-only ingress.
Fly Kubernetes: Flycast is one of the three equivalent access paths for K8s Services under FKS (alongside direct IPv6 and CoreDNS), so the pattern maps cleanly onto FKS Services.

Trade-offs¶

Adds a private-networking hop. Requests from the app tier traverse the WireGuard mesh; negligible latency in practice but the dependency is real. A degraded 6PN mesh breaks the inference path.
Debugging harder — curl from a laptop doesn't work; the developer needs a WireGuard peer or a Fly SSH session into a Machine in the org.
No unauthenticated external access, full stop. Projects that want public inference (e.g. open demo endpoints) need a different pattern (public Fly App fronted by aggressive rate-limiting + auth).

Seen in¶

sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description — Canonical source. Ollama on a GPU Fly Machine, access scoped to internal 6PN traffic from the PocketBase app via Flycast, so Fly Proxy autostop can manage Machine lifecycle on a well-defined notion of "idle".