Skip to content

PATTERN Cited by 1 source

Model registry and object store as hybrid glue

Problem

In a hybrid ML platform, compute (training) and serving (inference) run on different substrates. You need a way for training jobs on the compute side to produce artifacts that serving deployments on the serving side consume, without coupling the two stacks at the runtime or API level.

Pattern

Use four primitives as the entire cross-stack integration surface:

  1. Object store (S3) — the substrate for model binaries themselves. Training writes to a well-known prefix; serving reads from it. Object store is cheap, durable, and universally accessible.
  2. Model registry — a service that tracks artifact lineage (what model, what version, what training job produced it, where in S3, what metadata). Serving looks up "latest production" in the registry and pulls the S3 object.
  3. Container registry (ECR) — Docker images flow from CI/CD into both platforms. Same image can run on compute and on serving (see patterns/cross-platform-base-image).
  4. Event bus (EventBridge + SQS) — job-state events from the compute side propagate to consumers that need them, without the consumers polling the compute platform's API.

That's the entire cross-stack surface. No RPC, no shared database (except job metadata), no runtime coupling.

Why this is enough

Decoupled stacks only need to exchange:

  • Artifacts (models, images) → object store + container registry.
  • Metadata about artifacts (lineage, versions, status) → model registry.
  • Events about state changes (job finished, model approved) → event bus.

Notably not in the pattern:

  • Direct API calls between stacks.
  • Shared services (databases, queues, caches) other than the listed primitives.
  • Synchronous integration — everything is async / pull-based.

Lyft / LyftLearn 2.0

Canonical wiki instance. "Integration happens through the Model Registry and S3. Training jobs in SageMaker generate model binaries and save them to S3. The Model Registry tracks these artifacts, and model serving services pull them for deployment. Docker images flow from CI/CD through ECR to both platforms. The LyftLearn database maintains job metadata and model configurations across both stacks." (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture)

Plus EventBridge + SQS for job-state events (replacing the old K8s-era background watchers). The LyftLearn database is the one shared relational store, scoped tightly to job and model metadata.

Trade-offs

  • + Narrow surface — reasoning about cross-stack coupling is bounded to four primitives.
  • + Naturally async — compute bursts don't backpressure serving; serving outages don't stop training.
  • + Works across provider / substrate boundaries — same pattern works whether compute is SageMaker, Databricks, or on-prem Kubernetes.
  • − Eventual consistency at the glue layer — new model artifact isn't immediately available to serving; serving has to poll or consume an event.
  • − Model registry becomes load-bearing — if the registry is wrong about which artifact is current, serving deploys the wrong model.

Seen in

Last updated · 517 distilled / 1,221 read