Skip to content

SYSTEM Cited by 1 source

Metaflow Fast Data (Netflix)

Fast Data is the Netflix-internal Metaflow library for high-throughput Python access to Netflix's data warehouse. It is the data-layer integration used by Netflix Metaflow projects for last-mile data processing — feature transformations, batch inference, training — in cases where Spark is overkill but the working set is still "occasionally… terabytes of data" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

Two interfaces

┌───────────────────────────────────────────────────────────────┐
│  Fast Data                                                    │
│                                                               │
│   metaflow.Table                                              │
│   ─────────────────                                           │
│   • Parses [Iceberg](<./apache-iceberg.md>) table metadata (or legacy Hive)   │
│   • Resolves partitions + Parquet file list                   │
│   • Read path + (recently added) write path                   │
│                                                               │
│                       │  (list of Parquet files)              │
│                       ▼                                       │
│   metaflow.MetaflowDataFrame                                  │
│   ──────────────────────────                                  │
│   • Downloads Parquet via Metaflow's high-throughput S3       │
│     client, directly into process memory                      │
│   • Uses [Arrow](<./apache-arrow.md>) as in-memory representation     │
│   • Zero-copy handoff to Pandas / Polars / internal C++       │
│     libraries                                                 │
└───────────────────────────────────────────────────────────────┘

"The Table object is responsible for interacting with the Netflix data warehouse which includes parsing Iceberg (or legacy Hive) table metadata, resolving partitions and Parquet files for reading. Recently, we added support for the write path, so tables can be updated as well using the library."

"Once we have discovered the Parquet files to be processed, MetaflowDataFrame takes over: it downloads data using Metaflow's high-throughput S3 client directly to the process' memory, which often outperforms reading of local files."

Dependency-graph discipline: stable Arrow C ABI

Most ML/data libraries in the Python ecosystem take a hard dependency on a specific PyArrow version, which makes it easy to hit unresolvable dependency graphs when Fast Data's C++ extensions and user code disagree on PyArrow. Netflix sidesteps this:

"(Py)Arrow is a dependency of many ML and data libraries, so we don't want our custom C++ extensions to depend on a specific version of Arrow, which could easily lead to unresolvable dependency graphs. Instead, in the style of nanoarrow, our Fast Data library only relies on the stable Arrow C data interface, producing a hermetically sealed library with no external dependencies."

References: nanoarrow, Arrow C data interface. This is a concrete instance of the principle "build against the ABI, not the release" for ML platform code that has to compose with many user environments.

Example workload: Content Knowledge Graph

Netflix's Content Knowledge Graph — relationships between titles, actors, and other film/series attributes — relies on Fast Data + Metaflow's foreach to resolve "approximately a billion pairs" for entity resolution. Each Metaflow task loads a shard via Table + MetaflowDataFrame, matches with Pandas, and writes an output shard; the output table is committed once all tasks finish.

"A key challenge in creating a knowledge graph is entity resolution. There may be many different representations of slightly different or conflicting information about a title which must be resolved. This is typically done through a pairwise matching procedure for each entity which becomes non-trivial to do at scale."

Seen in

Last updated · 319 distilled / 1,201 read