SYSTEM Cited by 1 source
LyftLearn Compute¶
LyftLearn Compute is the compute half of LyftLearn 2.0 — the SageMaker-backed stack that runs training, batch processing, Hyperparameter Optimization (HPO), and JupyterLab notebooks. The other half — LyftLearn Serving — remains on Kubernetes.
Architecture¶
- SageMaker Manager Service — an in-house orchestrator service that translates LyftLearn workflow APIs into AWS SDK calls against SageMaker Jobs, SageMaker Studio notebooks, and HPO.
- Event-driven state — EventBridge + SQS replace the pre-migration "background watchers" that polled job state on the K8s stack.
- Cross-platform base images — LyftLearn, LyftLearn Distributed (Spark), LyftLearn DL (GPU / deep-learning); the same image trains a model on SageMaker and serves it on K8s, detecting its runtime context at entrypoint. See patterns/cross-platform-base-image.
- Compatibility layer in the container entrypoint: fetches credentials from Confidant at startup, pulls extra env vars (SageMaker's API caps env-var count), reroutes StatsD to the metrics aggregation gateway (no sidecar), and stages hyperparameters through S3 because SageMaker's API cap on direct parameter passing is too small for Lyft's use cases. See patterns/runtime-fetched-credentials-and-config.
Startup latency — two optimisations against SageMaker cold start¶
SageMaker provisions instances on demand, so the default startup time is slower than K8s with warm idle nodes:
- JupyterLab notebooks: SOCI (Seekable OCI) lazy-loaded filesystem layers — 40–50% startup reduction versus pulling full multi-GB images.
- Training / batch jobs: SOCI "wasn't available" at migration time; Lyft optimised Docker image sizes first, then adopted SageMaker warm pools for the most latency-sensitive workloads — specifically models that retrain every 15 minutes. See patterns/warm-pool-zero-create-path.
Cross-cluster Spark¶
Interactive Spark in SageMaker Studio notebooks was a migration blocker: Spark client mode requires bidirectional communication between driver (now in SageMaker Studio) and executors (kept on EKS). Default SageMaker Studio networking blocked the required inbound connections. AWS introduced Studio-Domain networking changes in Lyft's account to enable inbound traffic from the EKS cluster, after which Spark performance and interactive UX were unchanged.
Seen in¶
Related¶
- systems/lyftlearn
- systems/lyftlearn-serving
- systems/aws-sagemaker-ai
- systems/apache-spark
- systems/amazon-soci
- concepts/hybrid-ml-platform-architecture
- concepts/cross-cluster-networking
- patterns/cross-platform-base-image
- patterns/runtime-fetched-credentials-and-config
- patterns/warm-pool-zero-create-path
- companies/lyft