SYSTEM Cited by 1 source
Metaflow¶
Metaflow is Netflix's open-source human-friendly framework for building and managing data, ML, and AI applications, originally developed at Netflix and released publicly at metaflow.org. Inside Netflix the same framework underpins hundreds of production ML projects via a rich set of internal integrations that bolt onto Netflix's company-wide data / compute / orchestration platforms (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).
Design posture¶
Netflix's MLP team frames Metaflow as a foundational layer plus integrations, intentionally leaving "team-specific domain libraries" to product teams on top. "While human-friendly APIs are delightful, it is really the integrations to our production systems that give Metaflow its superpowers. Without these integrations, projects would be stuck at the prototyping stage, or they would have to be maintained as outliers outside the systems maintained by our engineering teams, incurring unsustainable operational overhead." Canonical wiki instance of patterns/foundational-platform-plus-domain-libraries (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).
Stack layers (per the 2024-07-22 post)¶
| Layer | Open-source Metaflow target | Netflix-internal target |
|---|---|---|
| Data | S3 / local files | Fast Data on Iceberg |
| Compute | AWS Batch, Kubernetes | @titus on systems/netflix-titus |
| Dependencies | @conda, @pypi |
@conda, @pypi, plus portable environments via metaflow-nflx-extensions |
| Orchestration | AWS Step Functions, Argo Workflows, Airflow | Maestro |
| Deployment — precompute | External KV (ElastiCache, DynamoDB) | metaflow.Cache + metaflow.Hosting → see systems/netflix-metaflow-cache |
| Deployment — realtime | N/A in OSS | Metaflow Hosting |
Extension mechanism¶
"These integrations are implemented through Metaflow's extension
mechanism which is publicly available but subject to change, and
hence not a part of Metaflow's stable API yet." Template:
github.com/Netflix/metaflow-extensions-template.
See concepts/metaflow-extension-mechanism. Netflix's own
extensions package is
github.com/Netflix/metaflow-nflx-extensions,
which is where the portable execution environments feature
originated before @pypi was added to open-source Metaflow.
Representative API primitives cited in the post¶
@titus— run step on Titus (internal compute backend).@conda/@pypi— declarative Python dependency management per step.metaflow environmentcommand — CLI for building/fetching portable environments by name; used in the Explainer flow higher-order training pattern (see patterns/dynamic-environment-composition).foreachconstruct — "horizontal scaling" primitive used to shard the Content Knowledge Graph's ~1-billion-pair entity resolution across many Metaflow tasks.metaflow.Table— Iceberg/Hive metadata + partition + Parquet-file resolution, with a write path recently added.metaflow.MetaflowDataFrame— in-process Parquet reader over the Metaflow high-throughput S3 client + Arrow.metaflow.Cache— precomputed predictions key-value interface (paired withmetaflow.Hosting).metaflow.Hosting— decorator-driven REST endpoints with auto-scaling and scale-to-zero.- Event triggering — flows register as producers/consumers of events so that Metaflow flows integrate cleanly with surrounding ETL and team-owned downstream flows. See concepts/event-triggering-orchestration.
Scale disclosed¶
- "Hundreds of Metaflow projects deployed internally" at Netflix.
- Individual example workloads named in the post:
- ~1 billion title pairs processed via
foreach+ Fast Data (Content Knowledge Graph entity resolution). - 260M+ subscribers across 190+ countries served by the Content Decision Making flow graph (orchestrated by Maestro).
No fleet sizes, compute costs, p99 serving latencies, or head-count figures are given.
Seen in¶
Related¶
- companies/netflix
- systems/netflix-titus · systems/netflix-maestro · systems/netflix-metaflow-fast-data · systems/netflix-metaflow-hosting · systems/netflix-metaflow-cache · systems/netflix-amber
- systems/apache-iceberg · systems/apache-arrow · systems/apache-spark · systems/aws-step-functions · systems/argo-workflows
- concepts/foundational-ml-platform · concepts/portable-execution-environment · concepts/metaflow-extension-mechanism · concepts/event-triggering-orchestration
- patterns/foundational-platform-plus-domain-libraries · patterns/dynamic-environment-composition · patterns/precompute-then-api-serve · patterns/async-queue-feature-on-demand