Skip to content

PATTERN Cited by 1 source

Foundational platform + domain libraries

Build a minimal, reusable foundational platform. Let each team ship domain-specific libraries on top of it. This is the organisational shape that makes an internal platform viable across a long tail of diverse use cases without either forcing uniformity or drowning the platform team in per-team customisation.

Problem

A central platform team wants to serve N consumer teams, each with a different use case. Two failure modes:

  1. One-size-fits-all — the platform tries to encode every team's workflow. It either becomes too rigid (teams work around it) or too general (no one uses its abstractions). See the "outliers outside the systems maintained by our engineering teams, incurring unsustainable operational overhead" failure mode.
  2. Every team rolls its own — no shared substrate. Each team builds a bespoke path from prototype to production; there's nowhere to pay down operational debt.

Solution

Split the platform into two layers with a published contract between them:

Foundational layer (platform team):

  • Human-friendly API (the "SDK").
  • Integrations to production systems (data, compute, orchestrator, serving) via a published extension mechanism.
  • Canonical deployment paths (e.g. precompute-via-cache, real-time-via-decorator).

Domain-library layer (consumer teams):

  • Team-specific abstractions on top of the foundation (e.g. KYC agent orchestration, ranking-specific retrievers, explainer flows).
  • Configuration management.
  • Domain-specific glue.

The extension mechanism is the seam. It lets the platform team swap integrations without breaking user code, and lets a team build a new deployment path without forking the platform.

Canonical industrial instance

Netflix's MLP team describes Metaflow in exactly these terms:

"We provide a robust foundational layer with integrations to our company-wide data, compute, and orchestration platform, as well as various paths to deploy applications to production smoothly. On top of this, teams have built their own domain-specific libraries to support their specific use cases and needs." (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix)

Scale: "hundreds of Metaflow projects deployed internally" — each with its own domain library layer on top of the same Metaflow foundation. The Content Decision Making system "is managed by a relatively small team of engineers and data scientists autonomously" because the team owns its domain library; the platform team doesn't have to know about content decisions.

Trade-offs

  • Fragmentation risk: if the extension mechanism is weak, teams diverge silently. Netflix's extension mechanism is explicit and public but "subject to change, and hence not a part of Metaflow's stable API yet" — a deliberate trade-off for speed.
  • Migration risk for domain libraries when the foundation changes. Mitigated by keeping the foundation narrow.
  • Initial investment is higher than just shipping a feature- per-team platform. Pays off at the 10th team, not the first.

When not to use it

  • When there are only a handful of ML use cases and they're very similar — full-stack is fine.
  • When no one on the platform team is willing to maintain the extension contract — you just get fragmentation without the upside.

Seen in

Last updated · 550 distilled / 1,221 read