PATTERN Cited by 1 source

ML Platform as Internal Consulting Team¶

Intent¶

Split ML ownership between over a hundred product teams that own their own ML work (data scientists + ML engineers embedded in business domains) and a handful of central teams that build and operate the shared tools plus run an internal consulting arm (pair programming, trainings, architectural advice) to help product teams apply best practice. Research is a separate central team, not dispersed into product teams.

Context¶

Works well at large orgs (50+ product teams doing ML) where:

Every product team owns a distinct business domain — pushing a single centralised ML team to serve all of them does not scale.
ML work is heterogeneous enough (recommender systems, demand forecasting, size recommendation, fraud detection, etc.) that per-domain expertise matters.
But the tooling stack benefits from being shared — letting each product team build its own Step Functions + SageMaker wrapper would duplicate hundreds of person-months.
Consulting + training has non-linear returns — one central ML consultant pair-programming with a product team for two weeks raises that team's capability durably.

Zalando's organisational decomposition (2022)¶

From the canonical source (sources/2022-04-18-zalando-zalandos-machine-learning-platform):

Product teams (≥ 100) — each has its own software engineers and applied scientists. Own their business domain's ML work end-to-end.
Datalab + HPC team (central) — operates Datalab (hosted JupyterHub + R Studio + pre-wired data access) and the GPU HPC cluster.
zflow + pipeline monitoring (central, two teams) — build + maintain systems/zflow and the Backstage-based ML portal. Shared across all product teams.
ML consultants (central) — pair programming, architectural advice, and training with product teams. Explicitly not a build-for-product-teams arm; a capability-builder arm.
Research team (central) — explores state-of-the-art in algorithmics, deep learning, and "other branches of AI."
Data science community (cross-cutting, not a team per se) — platforms for "best practices from internal teams, academia, and the rest of the industry through expert talks, workshops, reading groups, and an annual internal conference."

Canonical disclosure¶

Verbatim from the 2022 post:

"Our experts are assisted by a few central teams which operate and develop some of the aforementioned tools. For example, a dedicated team provides support and improvements to our JupyterHub installation and the HPC cluster. Two teams actively develop zflow and monitoring tools for pipelines. Another group consisting of ML consultants works closely with product teams, offering trainings, architectural advice, and pair programming. A separate research team actively explores and disseminates the state-of-the-art in algorithmics, deep learning, and other branches of AI."

Consequences¶

Pros:

Product teams own domain expertise. A size-recommendation team's applied scientists know their users in ways a central team cannot.
Tooling consistency. All product teams use the same zflow
SageMaker + Backstage portal stack — cross-team debugging, onboarding, and knowledge transfer benefit.
Consulting is a force multiplier. One consultant team raises the ML-engineering altitude of many product teams.
Research is shielded from delivery pressure by being a separate team.
Community events scale knowledge transfer. The annual internal conference + reading groups let patterns travel across 100+ product teams without centralised coordination.

Cons:

Platform-team headcount is real. Zalando names at least five central teams (Datalab/HPC + 2× zflow/monitoring + ML consultants + research) plus community events.
Coordination overhead. A product team that wants a new zflow feature must coordinate with the zflow team; a backlog can form.
Knowledge gap between product teams and platform teams. Consultants help, but the platform team still risks being operationally far from the product team's actual needs.
Research-to-product gap is a separate, hard, recurring problem — a research team by itself does not solve the gap; it needs consultants or research-team-driven blog posts / workshops to close it.

Contrast with alternative org shapes¶

Centralised ML team (all ML under one roof) — scales to maybe 20–30 engineers; collapses under 100+-product-team load.
Fully decentralised (no central tooling) — each product team reinvents zflow-equivalent. Massive duplication.
Platform team with no consulting arm — tooling is built but adoption suffers; product teams struggle to apply it well without dedicated help.
Platform team with consulting arm but no research team — org falls behind state-of-the-art over time.

Zalando's four-way split (tools + consulting + research + community) is a richer decomposition than most public ML platform descriptions, and is explicitly motivated by the scale (100+ product teams).

Other references¶

The broader SRE parallel: Zalando's SRE program () runs the same distributed-ownership + central-support model — product teams own their services, while a unified DF SRE team provides centralized SRE primitives + advice. See patterns/unified-sre-team-over-federated.

companies/zalando — org context.
systems/zflow — the flagship shared tool.
systems/datalab-zalando · systems/zalando-hpc-cluster · systems/zalando-ml-portal-backstage — other shared tools.
patterns/managed-services-over-custom-ml-platform — the technical pattern that justifies investing in a shared Python wrapper.
patterns/python-dsl-wrapping-cloudformation — what the platform team actually builds.