Skip to content

PATTERN Cited by 1 source

BYO model via container

Definition

BYO model via container is the pattern of an inference platform productising "bring your own custom or fine-tuned model" as a container-image push — the customer packages the model weights + inference code into a platform-defined container format, pushes it to the platform, and the platform deploys and serves it on managed GPUs, exposing it through the same catalog + binding + gateway as first-party and third-party-hosted models.

Why containers, not model files

A naive "upload your .safetensors + a Python script" BYO model surface is brittle:

  • CUDA versions drift per platform.
  • Python / PyTorch versions conflict with platform-managed inference runtimes.
  • Custom kernels / ops (Triton, FlashAttention variants) need matching userland.
  • Model I/O pre-processing (image decoding, tokeniser setup) has library dependencies.
  • Warm-loading weights into GPU memory at container start requires controlled initialisation code.

A container captures the entire userland — CUDA, Python, requirements, inference code, weights — as one versioned, reproducible artefact. The platform's contract is a stable HTTP predict interface at the container's edge.

Mechanism

The container format requires the customer to supply:

  1. A build manifest (cog.yaml in Replicate's case) declaring Python version, OS + CUDA dependencies, and a requirements.txt of Python packages.
  2. A Predictor class with two methods:
  3. setup() — called once when the container warms up; loads weights into memory.
  4. predict(...) — called per inference request; inputs are typed via cog.Input(...), outputs by Python type annotation.

cog build produces a Docker image. cog push (or the platform's native push command) ships it to the platform, where a managed fleet of GPU hosts can pull-and-run the image on demand.

From the 2026-04-16 Cloudflare post:

# cog.yaml
build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"
# predict.py
from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -> Path:
        output = self.net(input)
        return output

Then (future API): cog build && wrangler ai push ./my-model.

Integration with the unified catalog

Once the container is pushed and accepted, the BYO model appears in the platform's unified catalog alongside first-party and third-party models. Callers invoke it through the same unified binding:

await env.AI.run("my-org/my-custom-model-v1", { image: src });

— and it inherits the same gateway features: observability, per-request metadata, rate limiting, and (in principle) failover to a fallback model if the custom container is unavailable.

Cold-start problem

A major caveat of container-packaged BYO models is cold start. Loading multi-GB model weights from storage into GPU memory on every container start is prohibitive if the container scales to zero between requests.

The 2026-04-16 post names Cloudflare's in-progress mitigation:

"We're working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting."

GPU snapshotting ≈ persisting the loaded-weights state of the GPU after setup() completes so subsequent container starts can "restore" into a warm-memory state without re-loading from disk. Not shipped; no latency numbers.

Trade-offs

  • Platform operator carries the GPU-pool economics. The customer builds the container; the platform reserves GPU capacity, handles cold-start cost, manages eviction on idle. Pricing model is platform-specific — Cloudflare hasn't disclosed BYO-model pricing beyond "dedicated instances for Enterprise customers" today.
  • Platform operator carries the compatibility bar. The container format evolves (new CUDA, new Python, new inference runtimes); old customer containers can bitrot. The platform has to either keep old kernels indefinitely or force rebuilds.
  • No provider-swap guarantees. Unlike catalog models where failover exploits same-model-across-providers, a BYO model has no equivalent outside the customer's own container. Reliability sits fully with the single container + the platform's scheduler.
  • Observability asymmetry. Custom containers expose platform-controllable metrics (latency, error rate) but the model's internal behaviour (token probabilities, tool-call success) is opaque to the platform unless the customer emits telemetry.
  • Security boundary. A customer-authored container runs arbitrary Python + CUDA code on platform GPUs. The platform has to sandbox network egress, filesystem access, and side-channel risks — same concerns as serverless containers but on shared GPU hardware where micro-architectural isolation is weaker.

Seen in

  • sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — canonical wiki instance. Workers AI exposes BYO-model via Replicate Cog containers. Current scope: Enterprise customers with dedicated instances + an external design-partner cohort. Roadmap: customer-facing push APIs, wrangler CLI, GPU- snapshotting-based cold-start acceleration. Strategic frame: the Replicate team joined Cloudflare's AI Platform team ("we don't even consider ourselves separate teams anymore"), bringing their open-source Cog format in as the Workers AI BYO substrate.

Contrast with sibling patterns

  • Managed-model-file upload (platform accepts .safetensors or .gguf and handles serving): sidesteps the CUDA / Python-version complexity by controlling the serving runtime, at the cost of locking out custom inference code, custom tokenisers, or exotic model families.
  • Dedicated inference endpoint (platform reserves GPUs for a specific customer model, exposed as a private URL): the legacy "Enterprise only" shape that this pattern productises more broadly.
  • patterns/managed-sidecar / patterns/pluggable-component-architecture — the platform-absorbs-substrate-work framing is the same generic pattern as this; BYO-via-container specialises it to the GPU-serving case.
Last updated · 200 distilled / 1,178 read