Skip to content

SYSTEM Cited by 1 source

Replicate Cog

Definition

Cog (cog.run, github.com/replicate/cog) is Replicate's open-source ML-model containerisation format. A model author writes two files:

  • cog.yaml — declarative build spec: Python version, OS + CUDA dependencies, Python requirements.txt, and the entry-point predict.py:Predictor.
  • predict.py — a Python class subclassing cog.BasePredictor with a setup() method (loads weights into memory once, at container warm-up) and a predict(...) method (runs a single prediction, typed inputs via cog.Input, typed outputs via Python/return annotation).

Running cog build produces a Docker image containing the model, weights, and inference code — ready to serve via HTTP against the Cog-defined predict signature.

From the 2026-04-16 Cloudflare post's minimal example:

# cog.yaml
build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"
# predict.py
from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -> Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output

Why it matters — Cog as the Workers AI BYO-model substrate

The 2026-04-16 Cloudflare post announces Cog as the packaging format for bringing a custom or fine-tuned model to Workers AI. Post framing:

  • "Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc."
  • "Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs."

The workflow Cloudflare is productising:

  1. Customer writes cog.yaml + predict.py.
  2. cog build produces a container.
  3. Push the container to Workers AI ("customer-facing APIs and wrangler commands" are on the roadmap, not shipped yet).
  4. Cloudflare deploys and serves on its GPU fleet.
  5. The model appears on the customer's AI Gateway catalog alongside @cf/… models and third-party providers, callable through the same env.AI.run(...) binding.

Currently Enterprise + design-partner access only: "The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform." Post: "We've been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you're interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI."

Strategic context — Replicate is now the AI Platform team

The 2026-04-16 post notes: "The Replicate team has officially joined our AI Platform team, so much so that we don't even consider ourselves separate teams anymore. We've been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you'll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well."

Cog is not a new technology — it has been Replicate's open- source hosting format for years — but this post repositions it as the Workers AI BYO-model substrate, which explains the catalog expansion from text-LLM-dominated to multimodal (image, video, speech) in the same announcement.

Roadmap

  • Customer-facing APIs to push Cog containers (not shipped).
  • wrangler CLI commands to push from a dev workstation (not shipped).
  • Faster cold starts via GPU snapshotting — the post names GPU snapshotting as a work-in-progress technique to make warm-from-weights-already-in-GPU faster than the load-weights-then-predict path a freshly-scheduled Cog container would otherwise do. (Not shipped; no cold-start latency numbers disclosed.)

Seen in

Last updated · 200 distilled / 1,178 read