Skip to content

PATTERN Cited by 1 source

Decoupled compute and serving stacks

Problem

An ML platform runs two fundamentally different workload classes — compute (training, batch inference, hyperparameter optimisation, interactive notebooks) and serving (real-time inference behind latency-sensitive endpoints). Running them on one substrate forces a compromise: either pay for idle capacity to keep compute fast, or suffer serving cold-starts that blow p99 budgets.

Pattern

Split the platform into two purpose-built stacks, each optimised for its own workload shape:

  • Compute stack on a managed serverless substrate (e.g. SageMaker) — on-demand provisioning, scale-to-zero economics, per-job billing, no idle cost. Cold start is the cost; mitigate with warm pools and lazy image loading for the most latency-sensitive jobs.
  • Serving stack on a long-lived container platform (e.g. EKS / Kubernetes) — warm pods, HPAs, PDBs, service meshes, fine-grained ops control, per-team prediction services.

The stacks are fully decoupled at the runtime level. They integrate only through a narrow set of cross-stack seams — see patterns/model-registry-and-object-store-as-hybrid-glue for the glue.

When it's the right shape

This pattern is appropriate when:

  • The compute workloads are bursty (idle most of the time, with occasional large bursts of training / batch jobs) so that serverless on-demand economics meaningfully beat always-on clusters.
  • The serving workloads are sustained and latency-sensitive so that warm K8s primitives are worth the idle capacity cost.
  • The team organisation is such that per-team serving services (each with custom prediction handlers) is natural — K8s's per-service primitives fit.

Lyft / LyftLearn 2.0

Canonical wiki instance. LyftLearn 2.0 split into LyftLearn Compute on SageMaker and LyftLearn Serving on EKS:

  • Compute — SageMaker Manager Service orchestrates training, batch, HPO, JupyterLab notebooks via AWS SDK. EventBridge + SQS carry job-state events (replacing background watchers). On-demand provisioning eliminates idle K8s capacity cost.
  • Serving — EKS clusters host dozens of team-owned model-serving services (pricing, fraud, dispatch, ETA, etc.), each with custom prediction handlers. A Model Registry Service coordinates deployments across services.

Integration happens only through artifact flow: S3 for model binaries, Model Registry for lineage, ECR for Docker images that flow to both stacks, and the LyftLearn database for job + model metadata. "Each LyftLearn product operates independently while maintaining seamless end-to-end ML workflows." (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture)

Trade-offs

  • + Each stack optimised for its workload; no compromise on idle cost vs. warm-pod latency.
  • + Independent scaling, deployment, and operation of the two stacks.
  • − Two operational surfaces to maintain instead of one.
  • − Model-registry + object-store glue is load-bearing; if that seam breaks, end-to-end workflows stop.
  • − Cross-stack networking can surface issues (e.g. cross-cluster Spark for Lyft) that single-stack architectures never hit.

Seen in

Last updated · 517 distilled / 1,221 read