PATTERN

Managed Services over Custom ML Platform¶

Intent¶

When a team's in-house ML platform (often Scala + Spark, in use for several years) starts to hurt — framework lock-in, custom code duplicating managed-service functionality, slow scale-up, coupled preprocessing + training — replace it with a thin Python orchestration layer over managed services (Step Functions + Lambda + SageMaker + Databricks), accepting higher serving cost in exchange for framework flexibility, isolation, and reduced maintenance.

Context¶

A typical timeline:

Phase 1 (early / prototype) — team uses Python + scikit-learn on a single machine or small cluster.
Phase 2 (scale-out) — team migrates to Scala + Spark to process more data, run more models, serve more traffic. A custom platform emerges.
Phase 3 (maturity) — the Scala/Spark platform now carries technical debt: Scala coupling blocks state-of-the-art libraries (most ML is Python); much of the custom code has been superseded by managed services; memory and latency profiles are poor; the monolith couples steps that should be decoupled.
Phase 4 (this pattern) — migrate to managed services via a thin wrapper library; reduce in-house code surface; accept cost increase.

Pain-point catalogue that motivates the migration¶

From Zalando's 2021 Payments retrospective (), four named pain points are canonical for this phase:

Language/framework coupling blocks ecosystem. "It's highly coupled to Scala and Spark which makes using state of the art libraries (mostly Python) difficult."
Custom code that managed services now offer. "It contains custom tailored code for functionalities which nowadays can be replaced by managed services. This adds an additional layer of complexity, making it difficult to maintain and to onboard new team members."
Resource profile & scalability problems. "It uses a lot of memory, suffers from latency spikes, new instances start rather slowly which affects scalability."
Monolithic coupling of pipeline stages. "It has a monolithic design, meaning that feature preprocessing and model training are highly coupled. There is no pipeline with clear steps and everything runs on the same cluster during training."

Solution¶

A thin Python orchestration library (Zalando calls theirs systems/zflow) that wraps:

systems/aws-step-functions — declarative workflow state machine.
systems/aws-lambda — glue compute between steps.
Amazon SageMaker — training jobs, batch-transform jobs, endpoints, inference pipeline models.
systems/databricks — Spark jobs for heavy data work + report generation.

A single zflow workflow expresses the ML pipeline as a sequence: preprocessing → training → batch predictions → performance report → endpoint deployment. Each step is a managed-service invocation, not custom in-house code. Each model gets its own SageMaker endpoint (per concepts/model-per-endpoint-isolation-tradeoff).

Consequences¶

Pros (benefits Zalando names verbatim):

Framework plurality — each model can use its own stack (XGBoost / PyTorch / TensorFlow).
Per-model isolation — one model's traffic spike doesn't affect others.
Faster scale-up — ~50% reduction in instance scale-up time vs legacy.
Clear pipeline stages — preprocessing, training, inference are decoupled into distinct workflow steps.
Reduced maintenance — less in-house code; onboarding new team members is easier because the stack is the managed-service stack.

Cons:

Cost. Up to ~2× (200%) projected serving-cost increase vs the legacy shared-instance model. Must be explicitly accepted.
Vendor lock-in. The pipeline's shape is now SageMaker- and Step-Functions-shaped; portability across clouds is reduced.
Debugging complexity moves — no single cluster to SSH into; debugging a failed step means debugging Step Functions + SageMaker
Databricks each with their own surface.
Requires a platform team capable of building the wrapper. zflow itself is a Python library that a dedicated ML Platform team builds and maintains; consuming teams depend on it.

Canonical instance¶

Zalando Payments × Zalando ML Platform, 2020–2021 (9-month collaboration):

Legacy: Scala + Spark monolith (2015–2020).
New: zflow workflow → Step Functions state machine invoking Databricks preprocessing, SageMaker training jobs, SageMaker batch transform, Databricks PDF report, and SageMaker endpoint with inference pipeline model (scikit-learn preprocessing + XGBoost / PyTorch / TF main model).
Load-tested at 200–1000 RPS on m5.large / m5.4xlarge / m5.12xlarge with p99 < 200 ms.
Cost: projected +200%; accepted.
Scale-up time: ~50% faster than legacy.
Source: .