CONCEPT Cited by 1 source

Last-mile data processing¶

Last-mile data processing is the class of Python data work that happens after the heavy ETL has already landed data in the warehouse — feature transformations, batch inference, and training-data prep — where Spark is overkill but the working set can still "occasionally… [be] terabytes of data" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

Why it's a distinct layer¶

ETL and last-mile processing have different shapes:

Axis	ETL (Spark)	Last mile (Python)
Runtime	JVM, distributed	Python process, often single-node or `foreach`-sharded
API style	Declarative SQL / DataFrame	Pandas / Polars / in-memory numpy
Scheduling	Batch DAGs	Inside an ML flow step
Coupling	Loose (output table)	Tight (lives next to model code)

Last-mile code lives inside the training / inference flow, so the performance bottleneck is "how fast can I get the bytes out of the warehouse into this Python process?" — not "how do I shuffle 10 TB across a cluster?"

How Metaflow serves the last mile¶

Netflix's Fast Data library sits exactly in this niche:

metaflow.Table — Iceberg/Hive metadata parsing, partition
Parquet file resolution.
metaflow.MetaflowDataFrame — pull Parquet files via the Metaflow high-throughput S3 client directly into process memory ("often outperforms reading of local files").
Arrow as the in-memory representation, zero-copy handoff to Pandas / Polars / internal C++ libraries.
nanoarrow-style ABI-only dependency on Arrow so the lib doesn't collide with user PyArrow versions.

The foreach primitive horizontally scales a last-mile step across many tasks — each task reads its own shard via Table + MetaflowDataFrame. The Content Knowledge Graph entity-resolution flow processes ~1 billion title pairs this way.

Scope boundary¶

If you need distributed compute — cross-shard joins, terabyte shuffles — you leave the last mile and return to Spark. The discipline is keeping last-mile code out of Spark when a single foreach-sharded Python step can handle the workload.

Seen in¶

sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix

Last-mile data processing¶

Why it's a distinct layer¶

How Metaflow serves the last mile¶

Scope boundary¶

Seen in¶

Related¶