Skip to content

PATTERN Cited by 1 source

Cross-platform base image

Problem

A workload needs to run in multiple execution contexts — e.g. managed training jobs, interactive notebooks, and K8s serving pods — and you want one Docker image to run correctly across all of them. Shipping three separate images per team triples build, test, and deploy surface; forces training-vs-serving dependency drift; and makes training-to-serving reproducibility hard to guarantee.

Pattern

Build one base image that detects its execution environment at runtime and adapts env vars, users, permissions, networking, and framework config accordingly (concepts/runtime-environment-detection). The entrypoint compatibility layer is where the detection happens and the adaptation is applied.

Structure

FROM base-runtime

# Core runtime: framework, libraries, OS packages — identical
# across execution contexts.
...

# Entrypoint script:
#   1. Detect context (SageMaker Job? Studio notebook? K8s pod?)
#   2. Fetch credentials / env / hyperparams / config appropriate
#      to the context.
#   3. Configure framework (Spark, metrics client, etc.) for the
#      context.
#   4. exec user command.
ENTRYPOINT ["/opt/entrypoint.sh"]
CMD []

Variant images per workload class, not per context

You may still maintain separate base images per workload class (traditional ML, distributed ML with Spark, GPU deep-learning), because those differ in heavyweight dependencies. But each of those images is itself cross-platform across execution contexts.

Lyft / LyftLearn 2.0

Canonical wiki instance. LyftLearn ships three cross-platform base images:

  • LyftLearn image — traditional ML workloads.
  • LyftLearn Distributed image — adds Spark ecosystem integration (custom wrappers, executor configs, JAR deps).
  • LyftLearn DL image — adds GPU + deep-learning libraries.

Each runs correctly in three distinct execution contexts — SageMaker Jobs, SageMaker Studio notebooks, and K8s model-serving — by detecting context at runtime and adapting. The same image trains a model on SageMaker and serves it on K8s; this guarantees model-dependency parity end-to-end (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture).

The Spark-compatible image was the hardest: it had to preserve compatibility with Lyft's existing Spark infrastructure (custom wrappers, executor configs, JARs) while running correctly in three contexts.

Trade-offs

  • + Single source of truth for the runtime; eliminates training-vs-serving drift.
  • + One test surface — image correctness validated once across all contexts.
  • − Entrypoint complexity grows with every context the image supports.
  • − Image size can bloat if contexts need different deps; tension with concepts/lazy-container-image-loading / SOCI / image-size optimisation.

Seen in

Last updated · 517 distilled / 1,221 read