LYFT 2025-11-18 Tier 2

Lyft — LyftLearn Evolution: Rethinking ML Platform Architecture¶

Summary¶

Lyft's ML Platform team describes LyftLearn 2.0 — a re-architecture of their internal ML platform that splits compute from serving: training, batch processing, hyperparameter optimisation and JupyterLab notebooks all move from the in-house Kubernetes-based LyftLearn to SageMaker (the new LyftLearn Compute), while real-time model serving stays on EKS/Kubernetes (the existing LyftLearn Serving). The hard constraint for the migration was zero ML-code changes for users — hundreds of engineers across dozens of teams — which forced the platform team to build an extensive compatibility layer inside cross-platform Docker base images that replicates the Kubernetes environment (credentials, environment variables, metrics, hyperparameters) on top of SageMaker, plus networking work with AWS to make interactive Spark in SageMaker Studio talk back to executors on EKS. The post is a concrete case study in environmental parity as a migration strategy: keep the execution engine swappable while keeping user workflows identical.

Key takeaways¶

Split-stack ML platform: SageMaker for compute, EKS/K8s for serving. LyftLearn 2.0's two halves — LyftLearn Compute on SageMaker (training, batch, HPO, notebooks) and LyftLearn Serving on EKS (real-time inference across dozens of team-owned services) — are fully decoupled and integrate only through the Model Registry and S3. This is the archetype of the decoupled compute and serving stacks pattern at ride-sharing scale (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture).
Zero-code-change was the non-negotiable migration constraint. "Forcing hundreds of users across dozens of teams to rewrite their business-critical ML workflows was not an option" — the burden of compatibility lived entirely in the platform. No modifications to training logic, data preprocessing, or inference code were allowed. This re-framed a cloud migration into a systems-engineering environmental-parity problem, and is the canonical wiki instance of zero-code-change platform migration.
Compatibility layer lives in the container entrypoint. Kubernetes primitives that SageMaker doesn't expose were re-synthesised at container startup inside Lyft's cross-platform base images: credentials pulled from Confidant and exposed in the exact form K8s injected via webhooks; additional environment variables fetched at runtime because SageMaker's API caps them; StatsD metrics redirected from absent sidecars to a direct connection to the metrics aggregation gateway; hyperparameters uploaded to S3 before each job and downloaded to SageMaker's standard input path, sidestepping SageMaker's API size limit that made ConfigMap-style direct passing infeasible. This is the runtime-fetched credentials and config pattern.
One base image across three execution contexts. The same Docker image trains the model (SageMaker Jobs), runs in SageMaker Studio notebooks, and serves the model on K8s — detecting its execution environment at runtime and adapting env vars, users, permissions, and Spark configuration accordingly. This guarantees training-to-serving consistency and is the cross-platform base image pattern. Three variants shipped: LyftLearn (traditional ML), LyftLearn Distributed (adds Spark wrappers/executors/JARs), and LyftLearn DL (adds GPU + deep-learning libraries).
Kubernetes-like startup on serverless compute needed two tricks. K8s notebook/job startup was fast because "a significant percentage of cluster resources sit idle" (i.e. paid-for warm nodes); SageMaker is fully on-demand. Two mitigations: (1) SOCI (Seekable OCI) lazy loading for JupyterLab notebook images — fetch only filesystem layers needed immediately instead of pulling multi-GB images — cut notebook startup 40–50%; (2) SageMaker warm pools for latency-sensitive training/batch jobs (some models retrain every 15 minutes), keeping instances alive between runs. SOCI "wasn't available" for training/batch jobs at the time, so warm pools filled that gap — this is the warm-pool zero-create path reused from cold-start literature.
Cross-cluster Spark networking was a migration blocker. Interactive Spark in JupyterLab notebooks previously ran driver + executors inside the same K8s cluster. In the new architecture the driver runs in a SageMaker Studio notebook while executors remain on EKS — breaking Spark's bidirectional client-mode assumption (driver → EKS API to request executor pods; executors → driver's SageMaker ENI inbound). "This issue was a fundamental blocker that could jeopardize the entire migration." Resolved by AWS introducing networking changes to the Studio Domains in Lyft's account to allow inbound traffic from the EKS cluster — yielding identical interactive Spark UX at unchanged performance.
Rollout was repository-by-repository, running both infrastructures in parallel. The migration was "nearly invisible" to users; under the hood, the platform team eliminated idle cluster waste (moving to on-demand provisioning), saw infrastructure-related incidents become rare, and freed the team from managing low-level infra so they could build platform capabilities instead.
Integration glue: Model Registry + S3, plus ECR, EventBridge, SQS. Training jobs in SageMaker write model binaries to S3; the Model Registry tracks them; serving services on K8s pull them for deployment. Docker images flow CI/CD → ECR → both platforms. The LyftLearn database holds job metadata and model configs across both stacks. EventBridge and SQS provide event-driven state management, replacing the old "background watchers." This Model Registry + object-store glue is what lets the two stacks stay decoupled but cooperate end-to-end.

Systems and primitives extracted¶

systems/lyftlearn — Lyft's internal ML platform; the 2.0 evolution is the subject of this post.
systems/lyftlearn-compute — the new compute half on SageMaker; training + batch + HPO + JupyterLab notebooks.
systems/lyftlearn-serving — the unchanged serving half on EKS; dozens of team-owned services fronting pricing / fraud / dispatch / ETA / etc. (detailed in Lyft's 2023 post referenced inline).
systems/aws-sagemaker-ai — the target compute substrate; specific features used: SageMaker Jobs, SageMaker Studio notebooks, warm pools, managed-service SDK integration.
systems/aws-eks / systems/kubernetes — the retained serving substrate.
systems/apache-spark — the distributed compute workload most affected by the hybrid architecture (interactive notebooks are the hard case).
systems/amazon-soci — Seekable OCI for lazy container image loading (40–50% notebook startup reduction).
systems/aws-s3 — hyperparameter staging channel; model-binary integration point between training and serving.
systems/amazon-ecr — image registry feeding both platforms.
systems/amazon-eventbridge + systems/aws-sqs — event-driven state management for SageMaker job lifecycle.
systems/jupyterlab — interactive notebook surface; the critical cross-cluster Spark use case.
systems/confidant — Lyft's in-house secret management system; previously auto-injected by K8s webhooks, now fetched at container entrypoint on SageMaker.
systems/statsd — the in-process metrics protocol previously routed to K8s sidecars, now reconfigured to connect directly to Lyft's metrics aggregation gateway.

Concepts surfaced¶

concepts/hybrid-ml-platform-architecture — splitting ML compute from ML serving onto different substrates tuned to each workload's access pattern.
concepts/zero-code-change-migration — the principle that the platform layer must absorb 100% of the migration's complexity, because forcing users to rewrite is organisationally untenable.
concepts/environmental-parity — the migration target is not "running on the new platform" but "environment indistinguishable from the old platform from user code's point of view."
concepts/container-entrypoint-compat-layer — the implementation technique for parity: synthesise missing platform primitives at container startup.
concepts/cross-cluster-networking — bidirectional connectivity between managed-service driver (SageMaker Studio) and self-managed executor pods (EKS), requiring explicit provider-side networking changes.
concepts/lazy-container-image-loading — the SOCI technique for pulling only the filesystem layers needed at startup rather than the whole image.
concepts/warm-pool-instances — keeping instances alive between runs to eliminate cold-start on a serverless substrate.
concepts/serverless-compute — the on-demand provisioning trade-off (no idle cost, slower cold start) that created this problem.
concepts/cold-start — the general adversary for latency-sensitive workloads that retrain every 15 minutes.
concepts/runtime-environment-detection — the base-image mechanism that lets a single Docker image adapt to three execution contexts.

Patterns¶

patterns/zero-code-change-platform-migration — canonical wiki instance; a platform-team commitment that trades platform complexity for user-facing invisibility.
patterns/cross-platform-base-image — one Docker image, many execution contexts, runtime-detected adaptation.
patterns/runtime-fetched-credentials-and-config — fetch secrets/env/hyperparams at container entrypoint rather than through the target platform's API, when the target platform's API can't match the source platform's mechanism (size limits, no sidecar, no webhook).
patterns/warm-pool-zero-create-path — reused from prior ingest; the production instance Lyft hits is "models retraining every 15 minutes."
patterns/decoupled-compute-and-serving-stacks — separate purpose-built stacks integrated only through model artifacts + object store + event bus.
patterns/model-registry-and-object-store-as-hybrid-glue — model registry + S3 as the only cross-stack API between hybrid ML compute and serving.

Operational numbers¶

JupyterLab notebook startup: −40% to −50% after adopting SOCI lazy-loading of filesystem layers (previously pulled multi-GB images in full).
15-minute retrain cadence for a subset of latency-sensitive models — the workload class that made SOCI-level reduction insufficient and motivated SageMaker warm pools.
Interactive Spark performance: unchanged after cross-cluster networking fix.
Compute cost: reduced (no disclosed percentage) by eliminating idle K8s cluster capacity and moving to on-demand SageMaker provisioning.
Infra incidents: "rare occurrences" (qualitative post-migration claim).

Architecture summary¶

                 +---------------------+        +---------------------+
                 |   LyftLearn Compute |        |   LyftLearn Serving |
                 |     (SageMaker)     |        |       (EKS/K8s)     |
                 |                     |        |                     |
                 | SageMaker Manager   |        | Model Registry Svc  |
                 | → training / batch  |        | → N per-team model  |
                 |   / HPO / notebooks |        |   serving services  |
                 | ← EventBridge / SQS |        |   (pricing, fraud,  |
                 |   state events      |        |    dispatch, ETA)   |
                 +----------+----------+        +----------+----------+
                            |                              |
                     model  |  artifacts                   | reads artifacts
                            v                              ^
                        +-------------------------------------+
                        |   S3 (artifacts)  +  Model Registry  |
                        |   ECR  (images, both stacks)         |
                        |   LyftLearn DB  (job + model metadata|
                        |                  across both stacks) |
                        +-------------------------------------+

Compatibility layer (inside cross-platform base image):
  entrypoint → fetch Confidant credentials → materialise env vars
             → reconfigure StatsD → direct gateway → pull hyperparams
             from S3 → detect runtime context (SageMaker Job vs. Studio
             notebook vs. K8s serving) → adapt Spark configuration.

Caveats¶

Serving-stack details are cross-referenced, not re-documented. The serving half (multi-team model-serving services on EKS + Model Registry) is covered by Lyft's 2023 "Powering Millions of Real-Time Decisions with LyftLearn Serving" post, linked inline; this post does not re-derive it.
No disclosed compute cost number. The post says compute cost dropped but gives no percentage or absolute dollar figure, unlike the 40–50% notebook startup number.
"SOCI wasn't available" for jobs at the time — specific to the window in which this migration ran. Warm pools were the stopgap. Subsequent SageMaker SOCI support for training jobs (if it lands) could alter the trade-off.
AWS made Studio Domain networking changes in Lyft's account to enable cross-cluster Spark — the post is candid this was an account-specific partnership intervention, not purely a customer-configurable feature at the time.
Migration dates not disclosed. No timeline given for how long the parallel-infrastructure rollout lasted; no per-team migration duration.
Author byline not surfaced in the raw scrape. The post is under the Lyft Engineering blog; specific author attribution is not in the captured markdown.