SYSTEM Cited by 1 source
Metaflow Fast Data (Netflix)¶
Fast Data is the Netflix-internal Metaflow library for high-throughput Python access to Netflix's data warehouse. It is the data-layer integration used by Netflix Metaflow projects for last-mile data processing — feature transformations, batch inference, training — in cases where Spark is overkill but the working set is still "occasionally… terabytes of data" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).
Two interfaces¶
┌───────────────────────────────────────────────────────────────┐
│ Fast Data │
│ │
│ metaflow.Table │
│ ───────────────── │
│ • Parses [Iceberg](<./apache-iceberg.md>) table metadata (or legacy Hive) │
│ • Resolves partitions + Parquet file list │
│ • Read path + (recently added) write path │
│ │
│ │ (list of Parquet files) │
│ ▼ │
│ metaflow.MetaflowDataFrame │
│ ────────────────────────── │
│ • Downloads Parquet via Metaflow's high-throughput S3 │
│ client, directly into process memory │
│ • Uses [Arrow](<./apache-arrow.md>) as in-memory representation │
│ • Zero-copy handoff to Pandas / Polars / internal C++ │
│ libraries │
└───────────────────────────────────────────────────────────────┘
"The Table object is responsible for interacting with the Netflix
data warehouse which includes parsing Iceberg (or legacy Hive) table
metadata, resolving partitions and Parquet files for reading.
Recently, we added support for the write path, so tables can be
updated as well using the library."
"Once we have discovered the Parquet files to be processed,
MetaflowDataFrame takes over: it downloads data using Metaflow's
high-throughput S3 client directly to the process' memory, which
often outperforms reading of local files."
Dependency-graph discipline: stable Arrow C ABI¶
Most ML/data libraries in the Python ecosystem take a hard dependency on a specific PyArrow version, which makes it easy to hit unresolvable dependency graphs when Fast Data's C++ extensions and user code disagree on PyArrow. Netflix sidesteps this:
"(Py)Arrow is a dependency of many ML and data libraries, so we don't want our custom C++ extensions to depend on a specific version of Arrow, which could easily lead to unresolvable dependency graphs. Instead, in the style of nanoarrow, our Fast Data library only relies on the stable Arrow C data interface, producing a hermetically sealed library with no external dependencies."
References: nanoarrow, Arrow C data interface. This is a concrete instance of the principle "build against the ABI, not the release" for ML platform code that has to compose with many user environments.
Example workload: Content Knowledge Graph¶
Netflix's Content Knowledge Graph — relationships between
titles, actors, and other film/series attributes — relies on Fast
Data + Metaflow's foreach to resolve "approximately a billion
pairs" for entity resolution. Each Metaflow task loads a shard via
Table + MetaflowDataFrame, matches with Pandas, and writes an
output shard; the output table is committed once all tasks finish.
"A key challenge in creating a knowledge graph is entity resolution. There may be many different representations of slightly different or conflicting information about a title which must be resolved. This is typically done through a pairwise matching procedure for each entity which becomes non-trivial to do at scale."