PATTERN Cited by 1 source
BYO model via container¶
Definition¶
BYO model via container is the pattern of an inference platform productising "bring your own custom or fine-tuned model" as a container-image push — the customer packages the model weights + inference code into a platform-defined container format, pushes it to the platform, and the platform deploys and serves it on managed GPUs, exposing it through the same catalog + binding + gateway as first-party and third-party-hosted models.
Why containers, not model files¶
A naive "upload your .safetensors + a Python script" BYO model surface is brittle:
- CUDA versions drift per platform.
- Python / PyTorch versions conflict with platform-managed inference runtimes.
- Custom kernels / ops (Triton, FlashAttention variants) need matching userland.
- Model I/O pre-processing (image decoding, tokeniser setup) has library dependencies.
- Warm-loading weights into GPU memory at container start requires controlled initialisation code.
A container captures the entire userland — CUDA, Python, requirements, inference code, weights — as one versioned, reproducible artefact. The platform's contract is a stable HTTP predict interface at the container's edge.
Mechanism¶
The container format requires the customer to supply:
- A build manifest (
cog.yamlin Replicate's case) declaring Python version, OS + CUDA dependencies, and arequirements.txtof Python packages. - A
Predictorclass with two methods: setup()— called once when the container warms up; loads weights into memory.predict(...)— called per inference request; inputs are typed viacog.Input(...), outputs by Python type annotation.
cog build produces a Docker image. cog push (or the
platform's native push command) ships it to the platform, where
a managed fleet of GPU hosts can pull-and-run the image on
demand.
From the 2026-04-16 Cloudflare post:
# cog.yaml
build:
python_version: "3.13"
python_requirements: requirements.txt
predict: "predict.py:Predictor"
# predict.py
from cog import BasePredictor, Path, Input
import torch
class Predictor(BasePredictor):
def setup(self):
self.net = torch.load("weights.pth")
def predict(self,
image: Path = Input(description="Image to enlarge"),
scale: float = Input(description="Factor to scale image by", default=1.5)
) -> Path:
output = self.net(input)
return output
Then (future API): cog build && wrangler ai push ./my-model.
Integration with the unified catalog¶
Once the container is pushed and accepted, the BYO model appears in the platform's unified catalog alongside first-party and third-party models. Callers invoke it through the same unified binding:
— and it inherits the same gateway features: observability, per-request metadata, rate limiting, and (in principle) failover to a fallback model if the custom container is unavailable.
Cold-start problem¶
A major caveat of container-packaged BYO models is cold start. Loading multi-GB model weights from storage into GPU memory on every container start is prohibitive if the container scales to zero between requests.
The 2026-04-16 post names Cloudflare's in-progress mitigation:
"We're working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting."
GPU snapshotting ≈ persisting the loaded-weights state of the
GPU after setup() completes so subsequent container starts
can "restore" into a warm-memory state without re-loading
from disk. Not shipped; no latency numbers.
Trade-offs¶
- Platform operator carries the GPU-pool economics. The customer builds the container; the platform reserves GPU capacity, handles cold-start cost, manages eviction on idle. Pricing model is platform-specific — Cloudflare hasn't disclosed BYO-model pricing beyond "dedicated instances for Enterprise customers" today.
- Platform operator carries the compatibility bar. The container format evolves (new CUDA, new Python, new inference runtimes); old customer containers can bitrot. The platform has to either keep old kernels indefinitely or force rebuilds.
- No provider-swap guarantees. Unlike catalog models where failover exploits same-model-across-providers, a BYO model has no equivalent outside the customer's own container. Reliability sits fully with the single container + the platform's scheduler.
- Observability asymmetry. Custom containers expose platform-controllable metrics (latency, error rate) but the model's internal behaviour (token probabilities, tool-call success) is opaque to the platform unless the customer emits telemetry.
- Security boundary. A customer-authored container runs arbitrary Python + CUDA code on platform GPUs. The platform has to sandbox network egress, filesystem access, and side-channel risks — same concerns as serverless containers but on shared GPU hardware where micro-architectural isolation is weaker.
Seen in¶
- sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents —
canonical wiki instance. Workers AI
exposes BYO-model via Replicate Cog
containers. Current scope: Enterprise customers with
dedicated instances + an external design-partner cohort.
Roadmap: customer-facing push APIs,
wranglerCLI, GPU- snapshotting-based cold-start acceleration. Strategic frame: the Replicate team joined Cloudflare's AI Platform team ("we don't even consider ourselves separate teams anymore"), bringing their open-source Cog format in as the Workers AI BYO substrate.
Contrast with sibling patterns¶
- Managed-model-file upload (platform accepts
.safetensorsor.ggufand handles serving): sidesteps the CUDA / Python-version complexity by controlling the serving runtime, at the cost of locking out custom inference code, custom tokenisers, or exotic model families. - Dedicated inference endpoint (platform reserves GPUs for a specific customer model, exposed as a private URL): the legacy "Enterprise only" shape that this pattern productises more broadly.
- patterns/managed-sidecar / patterns/pluggable-component-architecture — the platform-absorbs-substrate-work framing is the same generic pattern as this; BYO-via-container specialises it to the GPU-serving case.
Related¶
- patterns/ai-gateway-provider-abstraction — how BYO models appear inside the gateway's catalog.
- patterns/unified-inference-binding — the call surface BYO models plug into.
- patterns/managed-sidecar — the generic "platform absorbs substrate" pattern family.
- patterns/pluggable-component-architecture — adjacent extensibility pattern.
- concepts/unified-model-catalog — the catalog property BYO models participate in.
- concepts/container-ephemerality — the generic substrate property BYO models inherit.
- concepts/cold-start — the core operational concern BYO containers have to solve for.
- systems/replicate-cog — the canonical container format.
- systems/workers-ai — the canonical BYO target platform.