Skip to content

NETFLIX 2024-07-22

Read original ↗

Netflix — Supporting Diverse ML Systems at Netflix

Summary

Netflix's Machine Learning Platform (MLP) team describes how Metaflow — the open-source ML framework they started — is integrated with Netflix's internal production stack to support hundreds of diverse Metaflow projects running in production. The authors frame the design as a foundational layer with integrations (data, compute, orchestration, deployment) plus team-specific domain libraries on top, so that practitioners can pick the path from prototype to production that fits their use case. Five layers are covered with one named example project each: Data (Fast Data library on top of Iceberg on S3, used by Netflix's Content Knowledge Graph entity-resolution pipeline processing ~1 billion title pairs); Compute (Titus replaces AWS Batch / Kubernetes from open-source Metaflow, with portable execution environments used by the "Explainer flow" that dynamically composes another model's environment at runtime); Orchestration (Maestro replaces Step Functions / Argo / Airflow, with event-triggering powering the business-critical Content Decision Making system covering >260M subscribers across

190 countries); Deployment — Cache (metaflow.Cache + metaflow.Hosting combined with a Streamlit app for interactive content-performance visualisation); Deployment — Hosting (Metaflow Hosting is decorator-driven REST, scale-to-zero, tightly integrated with Netflix's microservice infra, used via Amber — the media feature store — to on-demand-compute and cache media features via asynchronous Hosting queues). The integrations are implemented through Metaflow's extension mechanism (public but not-yet-stable API at github.com/Netflix/metaflow-extensions-template). Future work: better versioning layer for artifact/model management, and making Metaflow models more integrable into non-Python environments (e.g. JVM edge services).

Key takeaways

  1. Central thesis: one foundational ML platform + many domain libraries, not one shape for all projects. "Given the very diverse set of ML and AI use cases we support — today we have hundreds of Metaflow projects deployed internally — we don't expect all projects to follow the same path from prototype to production. Instead, we provide a robust foundational layer with integrations to our company-wide data, compute, and orchestration platform, as well as various paths to deploy applications to production smoothly. On top of this, teams have built their own domain-specific libraries to support their specific use cases and needs." Canonical instance of patterns/foundational-platform-plus-domain-libraries (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

  2. Integrations are the superpower, not the ergonomic API. "While human-friendly APIs are delightful, it is really the integrations to our production systems that give Metaflow its superpowers. Without these integrations, projects would be stuck at the prototyping stage, or they would have to be maintained as outliers outside the systems maintained by our engineering teams, incurring unsustainable operational overhead." The load-bearing lesson for internal ML platform strategy.

  3. Extension mechanism is how integrations are injectedmetaflow-extensions-template public template (but "subject to change, and hence not a part of Metaflow's stable API yet") — concepts/metaflow-extension-mechanism.

  4. Data — Fast Data on top of Iceberg. "Our main data lake is hosted on S3, organized as Apache Iceberg tables. For ETL and other heavy lifting of data, we mainly rely on Apache Spark. In addition to Spark, we want to support last-mile data processing in Python, addressing use cases such as feature transformations, batch inference, and training. Occasionally, these use cases involve terabytes of data, so we have to pay attention to performance." Two interfaces: metaflow.Table parses Iceberg (or legacy Hive) metadata, resolves partitions + Parquet files, supports writes too (recently added); metaflow.MetaflowDataFrame downloads Parquet using Metaflow's high-throughput S3 client "directly to the process' memory, which often outperforms reading of local files." Arrow in-memory representation + zero-copy to Pandas / Polars / internal C++ libraries. Dependency discipline: instead of pinning PyArrow, the library relies only on the stable Arrow C data interface ("in the style of nanoarrow"), producing a "hermetically sealed library with no external dependencies." concepts/last-mile-data-processing.

  5. Data example: Content Knowledge Graph entity resolution at ~1 billion pairs. "A key challenge in creating a knowledge graph is entity resolution. There may be many different representations of slightly different or conflicting information about a title which must be resolved. This is typically done through a pairwise matching procedure for each entity which becomes non-trivial to do at scale." Uses Fast Data + Metaflow's foreach construct to load "approximately a billion pairs" and shard matching across many parallel tasks. Each task reads via Table+MetaflowDataFrame, matches with Pandas, writes a shard; table is committed on completion.

  6. Compute — Titus replaces open-source AWS Batch / Kubernetes backend. "Whereas open-source users of Metaflow rely on AWS Batch or Kubernetes as the compute backend, we rely on our centralized compute-platform, Titus. Under the hood, Titus is powered by Kubernetes, but it provides a thick layer of enhancements over off-the-shelf Kubernetes, to make it more observable, secure, scalable, and cost-efficient." Targeting @titus delegates the "battle-hardened features out of the box, with no in-depth technical knowledge or engineering required from the ML engineers or data scientist end." systems/netflix-titus.

  7. Dependency management — @conda + @pypi + portable environments. "Metaflow provides support for dependency management out of the box. Originally, we supported only @conda, but based on our work on Portable Execution Environments, open-source Metaflow gained support for @pypi a few months ago as well." The key property is don't make developers manage Docker images. concepts/portable-execution-environment.

  8. Compute example: Explainer flow composes environments at runtime. "Stakeholders like to understand why models produce a certain output and why their behavior changes over time... there are several ways to provide explainability to models but one way is to train an explainer model based on each trained model." The higher-order flow takes "a full execution environment of another training system as an input and produce a model based on it"build_environment step uses metaflow environment to build a combined environment containing both the input model's dependencies and the explainer-model's dependencies; the step is given a unique name keyed on run ID + model type; downstream train_explainer step operates inside that environment to both access the input model and train the explainer. patterns/dynamic-environment-composition. Unlike vanilla @conda/@pypi, portable environments let users "fetch those environments directly at execution time as opposed to at deploy time."

  9. Orchestration — Maestro replaces AWS Step Functions / Argo / Airflow. "If data is the fuel of ML and the compute layer is the muscle, then the nerves must be the orchestration layer." Open-source Metaflow supports Step Functions (since years ago), Argo Workflows (Kubernetes-native), and Airflow. Netflix internally runs on Maestro (scalability + HA + usability characteristics covered in a separate 2024 Netflix post on Maestro). Event-triggering is the named feature that often goes overlooked"it allows a team to integrate their Metaflow flows to surrounding systems upstream (e.g. ETL workflows), as well as downstream (e.g. flows managed by other teams), using a protocol shared by the whole organization." concepts/event-triggering-orchestration.

  10. Orchestration example: Content Decision Making at 260M+ / 190 countries. "We support a massive scale of over 260M subscribers spanning over 190 countries representing hugely diverse cultures and tastes, all of whom we want to delight with our content slate. Reflecting the breadth and depth of the challenge, the systems and models focusing on the question have grown to be very sophisticated." The high-level diagram mixes gray boxes (partner-team integrations), green boxes (ETL pipelines), and blue boxes (Metaflow flows) — "these boxes encapsulate hundreds of advanced models and intricate business logic, handling massive amounts of data daily." Operational claim: "Despite its complexity, the system is managed by a relatively small team of engineers and data scientists autonomously."

  11. Deployment — precomputed + cached predictions (Cache). "Not all API-based deployments require real-time evaluation... we have a number of business-critical applications where some or all predictions can be precomputed, guaranteeing the lowest possible latency and operationally simple high availability at the global scale." Officially supported pattern: scheduled Metaflow job aggregates → writes to metaflow.Cache key-value store → API served by metaflow.Hosting. Netflix's internal cache; open-source equivalents would be "Amazon ElastiCache or DynamoDB." concepts/precomputed-predictions-api.

  12. Cache example: Content performance visualisation via Streamlit. "A daily scheduled Metaflow job computes aggregate quantities of interest in parallel. The job writes a large volume of results to an online key-value store using metaflow.Cache. A Streamlit app houses the visualization software and data aggregation logic. Users can dynamically change parameters of the visualization application and in real-time a message is sent to a simple Metaflow hosting service which looks up values in the cache, performs computation, and returns the results as a JSON blob to the Streamlit application." patterns/precompute-then-api-serve.

  13. Deployment — real-time hosting (Metaflow Hosting). "Metaflow Hosting is specifically geared towards hosting artifacts or models produced in Metaflow. This provides an easy to use interface on top of Netflix's existing microservice infrastructure, allowing data scientists to quickly move their work from experimentation to a production grade web service that can be consumed over a HTTP REST API with minimal overhead." Listed benefits: "Simple decorator syntax to create RESTFull endpoints. The back-end auto-scales the number of instances used to back your service based on traffic. The back-end will scale-to-zero if no requests are made to it after a specified amount of time thereby saving cost particularly if your service requires GPUs to effectively produce a response. Request logging, alerts, monitoring and tracing hooks to Netflix infrastructure." Compared explicitly to AWS SageMaker Model Hosting, "but tightly integrated with our microservice infrastructure." systems/netflix-metaflow-hosting.

  14. Hosting example: Amber — feature store with on-demand feature compute via Hosting queues. "While Amber is a feature store, precomputing and storing all media features in advance would be infeasible. Instead, we compute and cache features in an on-demand basis... When a service requests a feature from Amber, it computes the feature dependency graph and then sends one or more asynchronous requests to Metaflow Hosting, which places the requests in a queue, eventually triggering feature computations when compute resources become available. Metaflow Hosting caches the response, so Amber can fetch it after a while. We could have built a dedicated microservice just for this use case, but thanks to the flexibility of Metaflow Hosting, we were able to ship the feature faster with no additional operational burden." patterns/async-queue-feature-on-demand. concepts/on-demand-feature-compute.

  15. Future work — versioning layer + non-Python environment integration. "We have plans to work on improvements in the versioning layer, which wasn't covered by this article, by giving more options for artifact and model management... Metaflow Hosting models are currently not well integrated into model logging facilities — we plan on working on improving this... Additionally we want to supply more ways Metaflow artifacts and models can be integrated into non-Metaflow environments and applications, e.g. JVM based edge service, so that Python-based data scientists can contribute to non-Python engineering systems easily."

Architecture at a glance

                              ┌──────────────────────────┐
                              │  Team-specific domain    │
                              │  libraries (per project) │
                              └───────────┬──────────────┘
                                          │ built on top of
                    ┌────────────────────────────────────────┐
                    │  Metaflow extension mechanism          │
                    │  (metaflow-extensions-template)        │
                    └─────────┬──────────────┬──────────┬────┘
                              │              │          │
              ┌───────────────┼──────────────┼──────────┼───────────────┐
              │               │              │          │               │
      ┌───────▼────────┐ ┌────▼────────┐ ┌───▼──────┐ ┌─▼────────┐ ┌────▼─────────┐
      │ Data           │ │ Compute     │ │ Orchestr.│ │ Deploy — │ │ Deploy —     │
      │ Fast Data      │ │ @titus      │ │ Maestro  │ │ Cache    │ │ Hosting      │
      │ Table +        │ │ @conda /    │ │ (OSS:    │ │ scheduled│ │ decorator →  │
      │ MetaflowDataFr │ │ @pypi /     │ │ Step Fns │ │ Metaflow │ │ REST         │
      │ ame            │ │ portable    │ │ / Argo / │ │ → Cache  │ │ auto-scale / │
      │ on Iceberg/S3  │ │ envs        │ │ Airflow) │ │ → API    │ │ scale-to-0   │
      │                │ │             │ │ event-   │ │          │ │              │
      │                │ │             │ │ triggered│ │          │ │              │
      └───────┬────────┘ └─────┬───────┘ └────┬─────┘ └─────┬────┘ └──────┬───────┘
              │                │              │             │             │
        Content KG       Explainer flow   Content       Content perf.   Amber
        entity           (env of another  decision      viz (Streamlit)  feature
        resolution       model as input)  making        precompute +     store
        (~1B pairs)                        (260M+ /      cache path      on-demand
                                           190 countries)                compute

Operational numbers

Layer Named data point Notes
Program "Hundreds of Metaflow projects deployed internally" Approximate program size
Data example ~1 billion title pairs Content Knowledge Graph entity resolution workload
Data example Terabytes of data Typical last-mile data-processing job size
Orchestration example 260M+ subscribers / 190+ countries Content Decision Making surface
Orchestration example "Hundreds of advanced models" Content Decision Making flow-graph encapsulation
Hosting Scale-to-zero Explicit cost-control benefit for GPU-backed services

Caveats

  • Architecture-overview voice. No fleet sizes, cluster counts, QPS/latency numbers, cost figures, team head-count splits, or named hardware disclosed. Outcomes are framed in programmatic terms ("hundreds of projects", "260M subscribers") rather than measured before/after.
  • No disclosed numbers for individual systems. Titus internal details, Maestro HA properties, Metaflow Hosting auto-scaling thresholds, Cache size/latency, Amber fleet size — none are quantified in this post.
  • Extension mechanism is not stable API. The template at github.com/Netflix/metaflow-extensions-template is "publicly available but subject to change, and hence not a part of Metaflow's stable API yet." Implementers should expect churn.
  • Explainer flow specifics are abstract. "Without going into the details of how this is done exactly, suffice to say that Netflix trains a lot of models, so we need to train a lot of explainers too." The runtime-fetched environment mechanism is described but no numbers on environment count, build time, cache-hit rate, or environment size are given.
  • Metaflow Hosting details deferred to older talk. "Although details have evolved a lot, this old talk still gives a good overview of the service" — the referenced YouTube talk is pre-2024; the current architecture may have diverged.
  • Amber mechanics deferred to prior post. "Amber, our feature store for media" links to Netflix's prior 2023 Scaling Media Machine Learning at Netflix post; on-demand-compute-via-Hosting- queue is named here but the dependency-graph resolution mechanism and cache-TTL policy are not.
  • Cross-pattern overlap with open-source ecosystem left implicit. The post notes open-source equivalents (AWS Batch, Kubernetes, Step Functions, Argo Workflows, Airflow, ElastiCache, DynamoDB, SageMaker Hosting) but does not benchmark them against the Netflix internals or explain the specific deltas that motivated each replacement.
  • Future work is directional, not scoped. Versioning-layer improvements, model-logging integration, and JVM-bridge work are named as plans without dates or design sketches.

Source

Last updated · 319 distilled / 1,201 read