ZALANDO 2021-02-15

Zalando — A Machine Learning Pipeline with Real-Time Inference¶

Summary¶

Zalando Payments + Zalando ML Platform teams describe the replacement of a custom Scala + Spark monolith (in use since 2015, migrated from an even earlier Python + scikit-learn setup) with a managed-service ML pipeline built on systems/zflow — an internal Python library wrapping systems/aws-step-functions, systems/aws-lambda, Amazon SageMaker, and systems/databricks Spark. The use case is deferred-payment risk scoring: customers who order without paying upfront, where accurate default-probability prediction decides who gets the convenience and who does not. The post is a retrospective on requirements, architecture, and load-test results after a 9-month collaboration between two teams.

The old system had four named pain points: (1) Scala/Spark coupling blocked use of state-of-the-art Python libraries; (2) in-house code duplicated managed-service functionality; (3) memory bloat, latency spikes, slow instance start; (4) monolithic training/preprocessing coupling on a single cluster. The new system resolves all four via a Step Functions workflow that orchestrates five stages — preprocessing (Databricks + scikit-learn batch transform on SageMaker), training (SageMaker training job), batch predictions (SageMaker batch transform), performance reporting (Databricks PDF report), and endpoint deployment — with each model served behind its own SageMaker endpoint via a SageMaker Inference Pipeline Model (scikit-learn preprocessing container + main model container, systems/xgboost/PyTorch/Tensorflow) co-located in one endpoint.

Load-test numbers (m5.large, m5.4xlarge, m5.12xlarge at rates 200–1000 RPS) and the explicit up to 200% cost increase that Zalando accepted in exchange for per-model isolation + framework flexibility + managed-service reliance make this post a useful concrete instance of the managed-services-over-custom-ML-platform migration trade-off.

Key takeaways¶

Four pain points of a Scala + Spark ML monolith drive migration to managed services. Zalando's original 2015 Scala/Spark fraud-detection pipeline had: (i) tight Scala coupling blocking Python ecosystem use; (ii) custom code replaceable by managed services; (iii) high memory footprint + latency spikes + slow new-instance startup; (iv) monolithic feature-preprocessing + training on one cluster with no clear pipeline stages. "It uses a lot of memory, suffers from latency spikes, new instances start rather slowly which affects scalability" (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
Production requirements are specific: p99.9 in milliseconds, hundreds-to-thousands RPS during sale events, multi-model from day one. "99.9% of responses must be returned under a threshold in the order of milliseconds"; "the busiest model must be able to handle hundreds of requests per second (RPS) on a regular basis. During sales events, the requests rate for a model may scale at a higher order of magnitude"; "several models, divided per assortment type, market, etc., must be available in the production service at any given time" (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
systems/zflow is Zalando's internal Python wrapper around Step Functions + Lambda + SageMaker + Databricks. First explicit disclosure on the wiki. "It is essentially a Python library built on top of AWS Step Functions, AWS Lambdas, Amazon SageMaker, and Databricks Spark, that allows users to easily orchestrate and schedule ML workflows" — built by Zalando's ML Platform team; a single zflow workflow orchestrates training-data preprocessing, training, batch predictions, PDF report, and endpoint deployment (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
Unified feature preprocessing via an inference pipeline model — scikit-learn container + main-model container in one endpoint. Named requirement: "The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases." Solution: a SageMaker Inference Pipeline Model with two containers — scikit-learn for JSON-feature extraction, then the main model (XGBoost/PyTorch/etc.) for prediction. One image pulls double duty during training (as a scikit-learn batch transform preprocessing step) and serving (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
Separate SageMaker endpoint per model → projected 2× cost, accepted. "Based on our estimates the cost of serving our models will increase significantly after the migration. We anticipate the increase by up to 200%. The main reason behind it is cost efficiency of the legacy system, where all the models are served from one big instance (multiplied for scaling). In the new system every model gets a separate instance(s)." Three explicit justifications accepted: per-model tech stack flexibility, traffic isolation (one model's flood doesn't affect others), managed services over custom code (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
Load-test envelope named per instance type and rate. All on m5 family: ml.m5.large @ 200 RPS → single instance, p99 < 80 ms; ml.m5.large @ 400 RPS → needs ≥ 4 instances for ~100% success, p99 < 50 ms; ml.m5.4xlarge or ml.m5.12xlarge @ 1000 RPS → ≥ 2 instances, p99 < 200 ms. 4-minute continuous-hit load test per configuration; varied instance type, count, and request rate (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
Scale-up time drops ~50% vs legacy Scala/Spark system. "Adding an instance to a SageMaker endpoint with our current configuration reduces scale-up time by 50% over our old system. However, we wish to explore options for reducing this time further." Specific pain-point-3 resolution — SageMaker endpoints scale faster than Zalando's custom Scala setup, though still not as fast as the team would like (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
9-month cross-team collaboration (Payments + ML Platform) with a Statement of Work as scoping instrument. "The entire collaboration lasted 9 months"; weekly replanning + daily standups; Kanban with user stories broken into tasks; friction points named verbatim (training-program interruptions, firefighting duties, domain-vs-tooling knowledge asymmetry). ML Platform is framed as an internal consulting organisation that "offers the services of data scientists and software engineers to accelerate onboarding to the platform" (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).

Systems extracted¶

systems/zflow — Zalando ML Platform's internal Python workflow library wrapping Step Functions + Lambda + SageMaker + Databricks. First wiki canonical page.
systems/aws-step-functions — workflow orchestrator for the five-stage ML pipeline.
systems/aws-lambda — glue compute invoked by Step Functions.
systems/aws-sagemaker-ai — parent managed-ML family; training jobs, batch-transform jobs, endpoints, inference pipelines.
systems/aws-sagemaker-endpoint — real-time model serving surface; one endpoint per model.
systems/sagemaker-inference-pipeline-model — multi-container endpoint composition (preprocessing container + main model container). First wiki canonical page.
systems/databricks — Spark substrate used for training-data preprocessing and PDF performance report generation.
systems/apache-spark — underlying engine of the legacy 2015 system and the Databricks stages in the new system.
systems/xgboost — one of the named main-model container frameworks. First wiki canonical page.

Concepts extracted¶

concepts/inference-pipeline-model — multi-container endpoint where request preprocessing and model inference are chained in-process, colocating scikit-learn feature extraction with the main-model container on the same instance. First wiki canonical page.
concepts/unified-feature-preprocessing-training-vs-serving — preprocessing logic is written once as a scikit-learn container; training uses it as a batch-transform step, serving uses it as the first stage of an inference pipeline model. Avoids the canonical duplicated-preprocessing bug. First wiki canonical page.
concepts/model-per-endpoint-isolation-tradeoff — each model gets its own SageMaker endpoint instance(s); costs up to 2× more than a shared-instance legacy system but buys per-model framework flexibility, per-model scaling, per-model traffic isolation. First wiki canonical page.

Patterns extracted¶

patterns/managed-services-over-custom-ml-platform — replace an in-house Scala+Spark ML platform with a thin orchestration layer over managed services (Step Functions + Lambda + SageMaker + Databricks); accept higher serving cost for framework freedom, isolation, faster scale-up, and reduced maintenance. First wiki canonical page.
patterns/unified-feature-extraction-training-serving — implement feature extraction in a single container (scikit-learn here) and use it both as a training-time batch-transform preprocessor and as the first stage of a serving-time inference pipeline. First wiki canonical page.

Operational numbers¶

Latency + throughput targets (requirements):

p99.9 < milliseconds order for deployed service responses.
Hundreds RPS regular rate for the busiest model.
One order of magnitude higher during sales events.

Load-test measured results (4-minute continuous, m5 family):

Instance	Count	Rate	Success	p99
ml.m5.large	1	200 RPS	high	< 80 ms
ml.m5.large	≥ 4	400 RPS	~100%	< 50 ms
ml.m5.4xlarge or ml.m5.12xlarge	≥ 2	1000 RPS	kept	< 200 ms

Cost / timing:

Up to 200% cost increase projected and accepted as migration tax.
50% scale-up time reduction per SageMaker endpoint instance addition vs legacy system.
9 months cross-team collaboration length.

History:

Pre-2015: Python + scikit-learn.
2015: migrated to Scala + Spark for scale.
2020: started exploring ML Platform tooling.
2021-02-15: this post (retrospective).

Architecture¶

Legacy (2015–2020, pre-migration)¶

Scala + Spark monolithic application; one big instance serving all models; feature preprocessing + model training coupled on the same cluster; custom code for functionality later available as managed services.

New (2020+, post-migration)¶

Step Functions workflow (authored via zflow):
  ┌─────────────────────────────────────────────────────┐
  │ 1. Training data preprocessing                       │
  │    - Databricks cluster                              │
  │    - scikit-learn batch transform job on SageMaker   │
  │ 2. Training a model                                  │
  │    - SageMaker training job                          │
  │ 3. Generating predictions                            │
  │    - SageMaker batch transform job                   │
  │ 4. Performance report                                │
  │    - Databricks job producing PDF                    │
  │ 5. Endpoint deployment                               │
  │    - SageMaker real-time endpoint                    │
  │    - Inference pipeline model:                       │
  │       [scikit-learn container] → [XGBoost/PyTorch]   │
  └─────────────────────────────────────────────────────┘

Each deployed model gets its own endpoint (its own instance(s)).

Caveats¶

Cost figure is a projection, not a measurement. "Based on our estimates" — the ~200% increase is pre-migration anticipation, not a post-cutover retrospective.
Latency numbers are load-test, not production. No production p99 numbers are reported; load tests are synthetic continuous hits for 4 minutes.
No accuracy / business-impact numbers. The post is infrastructure-focused: the original fraud-detection decision quality (precision / recall / tagging capability) is mentioned as a comparison requirement ("we must be able to compare the performance between the new and the old version of a model") but no numbers are shared.
zflow is named but not detailed. The library is referred to via an external LinkedIn article on zflow — internals (ASL generation, caching, step-reuse, model registry semantics) are not explained in this post.
No pricing details. The 200% cost increase is directional; per- endpoint hourly numbers, aggregate fleet cost, idle-capacity strategy (auto-scale-to-zero, serverless inference, multi-model endpoints) are not mentioned — this post predates SageMaker Multi-Model Endpoints becoming common.
Inference pipeline model second container is framework-plural. The post names XGBoost as the example but explicitly lists PyTorch and TensorFlow as also in-scope; no numbers are given per framework.
No PII / regulatory framing. Deferred-payment fraud risk has obvious regulatory surface (GDPR, PSD2 for Europe) but the post doesn't touch compliance aspects.