SYSTEM Cited by 1 source
LyftLearn¶
LyftLearn is Lyft's internal, end-to-end ML platform: the substrate on which hundreds of Lyft engineers train, evaluate, register, and serve the models that power pricing, fraud, dispatch, ETA, and other business-critical use cases. Originally built on Kubernetes as a single unified platform, it was re-architected in the 2024–2025 window into a hybrid two-halves design — LyftLearn Compute on SageMaker and LyftLearn Serving on EKS — integrated via Model Registry + S3.
Architecture (LyftLearn 2.0)¶
Two purpose-built stacks, fully decoupled at the runtime level, glued by model artifacts and event bus:
- LyftLearn Compute — SageMaker. Training, batch processing, HPO, JupyterLab notebooks. Orchestrated by the in-house SageMaker Manager Service via AWS SDK; state events via EventBridge + SQS.
- LyftLearn Serving — EKS / K8s. Dozens of team-owned model-serving services (pricing, fraud, dispatch, ETA, ...). Model Registry Service coordinates deployments. Documented in detail in Lyft's 2023 post on this serving tier.
- Integration — model binaries flow through S3; the Model Registry tracks artifact lineage; Docker images flow through ECR into both platforms. The LyftLearn database holds job metadata and model configs across stacks.
What changed in the 2.0 re-architecture¶
The 1.0 design ran everything on Kubernetes (training, batch, notebooks, serving) with a large idle capacity overhead to keep startup fast. The 2.0 split moved the compute half onto a managed serverless substrate (SageMaker) to capture on-demand economics while keeping the stateful, latency-sensitive serving half on K8s. See concepts/hybrid-ml-platform-architecture for why this split makes sense for ML workloads specifically, and patterns/decoupled-compute-and-serving-stacks for the general pattern.
Zero-code-change constraint¶
The defining engineering constraint of the migration was that no ML code could change — training scripts, preprocessing, inference code all had to keep working identically. This pushed all migration complexity into the platform via the cross-platform base image and runtime-fetched credentials and config patterns, and made environmental parity the primary migration metric (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture).
Seen in¶
- sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture — the 2.0 re-architecture post (canonical entry for this page).