Skip to content

SYSTEM Cited by 1 source

LyftLearn

LyftLearn is Lyft's internal, end-to-end ML platform: the substrate on which hundreds of Lyft engineers train, evaluate, register, and serve the models that power pricing, fraud, dispatch, ETA, and other business-critical use cases. Originally built on Kubernetes as a single unified platform, it was re-architected in the 2024–2025 window into a hybrid two-halves designLyftLearn Compute on SageMaker and LyftLearn Serving on EKS — integrated via Model Registry + S3.

Architecture (LyftLearn 2.0)

Two purpose-built stacks, fully decoupled at the runtime level, glued by model artifacts and event bus:

  • LyftLearn Compute — SageMaker. Training, batch processing, HPO, JupyterLab notebooks. Orchestrated by the in-house SageMaker Manager Service via AWS SDK; state events via EventBridge + SQS.
  • LyftLearn Serving — EKS / K8s. Dozens of team-owned model-serving services (pricing, fraud, dispatch, ETA, ...). Model Registry Service coordinates deployments. Documented in detail in Lyft's 2023 post on this serving tier.
  • Integration — model binaries flow through S3; the Model Registry tracks artifact lineage; Docker images flow through ECR into both platforms. The LyftLearn database holds job metadata and model configs across stacks.

What changed in the 2.0 re-architecture

The 1.0 design ran everything on Kubernetes (training, batch, notebooks, serving) with a large idle capacity overhead to keep startup fast. The 2.0 split moved the compute half onto a managed serverless substrate (SageMaker) to capture on-demand economics while keeping the stateful, latency-sensitive serving half on K8s. See concepts/hybrid-ml-platform-architecture for why this split makes sense for ML workloads specifically, and patterns/decoupled-compute-and-serving-stacks for the general pattern.

Zero-code-change constraint

The defining engineering constraint of the migration was that no ML code could change — training scripts, preprocessing, inference code all had to keep working identically. This pushed all migration complexity into the platform via the cross-platform base image and runtime-fetched credentials and config patterns, and made environmental parity the primary migration metric (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture).

Seen in

Last updated · 319 distilled / 1,201 read