CONCEPT Cited by 1 source
Last-mile data processing¶
Last-mile data processing is the class of Python data work that happens after the heavy ETL has already landed data in the warehouse — feature transformations, batch inference, and training-data prep — where Spark is overkill but the working set can still "occasionally… [be] terabytes of data" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).
Why it's a distinct layer¶
ETL and last-mile processing have different shapes:
| Axis | ETL (Spark) | Last mile (Python) |
|---|---|---|
| Runtime | JVM, distributed | Python process, often single-node or foreach-sharded |
| API style | Declarative SQL / DataFrame | Pandas / Polars / in-memory numpy |
| Scheduling | Batch DAGs | Inside an ML flow step |
| Coupling | Loose (output table) | Tight (lives next to model code) |
Last-mile code lives inside the training / inference flow, so the performance bottleneck is "how fast can I get the bytes out of the warehouse into this Python process?" — not "how do I shuffle 10 TB across a cluster?"
How Metaflow serves the last mile¶
Netflix's Fast Data library sits exactly in this niche:
metaflow.Table— Iceberg/Hive metadata parsing, partition- Parquet file resolution.
metaflow.MetaflowDataFrame— pull Parquet files via the Metaflow high-throughput S3 client directly into process memory ("often outperforms reading of local files").- Arrow as the in-memory representation, zero-copy handoff to Pandas / Polars / internal C++ libraries.
- nanoarrow-style ABI-only dependency on Arrow so the lib doesn't collide with user PyArrow versions.
The foreach primitive horizontally scales a last-mile step across
many tasks — each task reads its own shard via Table +
MetaflowDataFrame. The Content Knowledge Graph entity-resolution
flow processes ~1 billion title pairs this way.
Scope boundary¶
If you need distributed compute — cross-shard joins, terabyte
shuffles — you leave the last mile and return to Spark. The
discipline is keeping last-mile code out of Spark when a single
foreach-sharded Python step can handle the workload.