Skip to content

SYSTEM Cited by 1 source

Metaflow

Metaflow is Netflix's open-source human-friendly framework for building and managing data, ML, and AI applications, originally developed at Netflix and released publicly at metaflow.org. Inside Netflix the same framework underpins hundreds of production ML projects via a rich set of internal integrations that bolt onto Netflix's company-wide data / compute / orchestration platforms (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

Design posture

Netflix's MLP team frames Metaflow as a foundational layer plus integrations, intentionally leaving "team-specific domain libraries" to product teams on top. "While human-friendly APIs are delightful, it is really the integrations to our production systems that give Metaflow its superpowers. Without these integrations, projects would be stuck at the prototyping stage, or they would have to be maintained as outliers outside the systems maintained by our engineering teams, incurring unsustainable operational overhead." Canonical wiki instance of patterns/foundational-platform-plus-domain-libraries (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

Stack layers (per the 2024-07-22 post)

Layer Open-source Metaflow target Netflix-internal target
Data S3 / local files Fast Data on Iceberg
Compute AWS Batch, Kubernetes @titus on systems/netflix-titus
Dependencies @conda, @pypi @conda, @pypi, plus portable environments via metaflow-nflx-extensions
Orchestration AWS Step Functions, Argo Workflows, Airflow Maestro
Deployment — precompute External KV (ElastiCache, DynamoDB) metaflow.Cache + metaflow.Hosting → see systems/netflix-metaflow-cache
Deployment — realtime N/A in OSS Metaflow Hosting

Extension mechanism

"These integrations are implemented through Metaflow's extension mechanism which is publicly available but subject to change, and hence not a part of Metaflow's stable API yet." Template: github.com/Netflix/metaflow-extensions-template. See concepts/metaflow-extension-mechanism. Netflix's own extensions package is github.com/Netflix/metaflow-nflx-extensions, which is where the portable execution environments feature originated before @pypi was added to open-source Metaflow.

Representative API primitives cited in the post

  • @titus — run step on Titus (internal compute backend).
  • @conda / @pypi — declarative Python dependency management per step.
  • metaflow environment command — CLI for building/fetching portable environments by name; used in the Explainer flow higher-order training pattern (see patterns/dynamic-environment-composition).
  • foreach construct — "horizontal scaling" primitive used to shard the Content Knowledge Graph's ~1-billion-pair entity resolution across many Metaflow tasks.
  • metaflow.Table — Iceberg/Hive metadata + partition + Parquet-file resolution, with a write path recently added.
  • metaflow.MetaflowDataFrame — in-process Parquet reader over the Metaflow high-throughput S3 client + Arrow.
  • metaflow.Cache — precomputed predictions key-value interface (paired with metaflow.Hosting).
  • metaflow.Hosting — decorator-driven REST endpoints with auto-scaling and scale-to-zero.
  • Event triggering — flows register as producers/consumers of events so that Metaflow flows integrate cleanly with surrounding ETL and team-owned downstream flows. See concepts/event-triggering-orchestration.

Scale disclosed

  • "Hundreds of Metaflow projects deployed internally" at Netflix.
  • Individual example workloads named in the post:
  • ~1 billion title pairs processed via foreach + Fast Data (Content Knowledge Graph entity resolution).
  • 260M+ subscribers across 190+ countries served by the Content Decision Making flow graph (orchestrated by Maestro).

No fleet sizes, compute costs, p99 serving latencies, or head-count figures are given.

Seen in

Last updated · 319 distilled / 1,201 read