Skip to content

PATTERN Cited by 1 source

Zero-code-change platform migration

Problem

A platform team needs to move hundreds of user workloads from one execution substrate to another (K8s → managed serverless, VMs → containers, on-prem → cloud, etc.). Asking every user team to port their code is organisationally untenable — too many teams, too much business-critical logic, too much productivity loss during the rewrite window. But the two substrates don't expose the same primitives: credential injection, env-var handling, sidecars, hyperparameter passing, networking all differ.

Pattern

The platform absorbs 100% of the migration's complexity. No user code changes. The platform rebuilds, at the container or runtime layer, every primitive the source substrate provided so that user code sees an environment indistinguishable from what it ran on before (concepts/environmental-parity).

Implementation layers:

  1. Cross-platform base image — one Docker image that works on the old and the new substrate, detecting its context at runtime (concepts/runtime-environment-detection).
  2. Entrypoint compatibility layer — fetches / materialises the primitives the new substrate doesn't provide (see patterns/runtime-fetched-credentials-and-config).
  3. Narrow integration seams — cross-stack glue (artifact registries, object storage, event bus) rather than direct cross-platform coupling.
  4. Provider-side partnership for gaps the customer can't close — managed service networking changes, API lifts, feature backports.

Rollout

  • Build the compatibility layer first; validate with a small set of test workloads.
  • Cut over repository by repository, running both infrastructures in parallel so any team can fall back.
  • Users experience minimal visible change; the migration is observable only through infra-level changes (reduced idle cost, fewer infra incidents, tool-chain updates).

Lyft / LyftLearn 2.0

Canonical wiki instance. Lyft's ML Platform team moved training / batch / HPO / JupyterLab notebooks from Kubernetes to SageMaker under the hard constraint that no ML code could change — no modifications to training logic, preprocessing, or inference. The compatibility layer lived inside LyftLearn's cross-platform base images, the entrypoint script fetched Confidant credentials / extra env vars / hyperparameters and rerouted StatsD metrics to the gateway, and the rollout proceeded repository-by-repository with both stacks running in parallel. "For our users, the migration was nearly invisible." (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture)

Trade-offs

  • + Organisational tractability — the only way large migrations finish at scale.
  • + Risk localisation — infra bugs and user bugs don't conflate.
  • − Platform-layer complexity — the platform eats everything the new substrate doesn't natively do.
  • − Longer migration project — the platform team builds the compatibility layer before users can move, rather than pushing work in parallel to user teams.

Seen in

Last updated · 517 distilled / 1,221 read