Skip to content

CONCEPT Cited by 1 source

Last-mile data processing

Last-mile data processing is the class of Python data work that happens after the heavy ETL has already landed data in the warehouse — feature transformations, batch inference, and training-data prep — where Spark is overkill but the working set can still "occasionally… [be] terabytes of data" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

Why it's a distinct layer

ETL and last-mile processing have different shapes:

Axis ETL (Spark) Last mile (Python)
Runtime JVM, distributed Python process, often single-node or foreach-sharded
API style Declarative SQL / DataFrame Pandas / Polars / in-memory numpy
Scheduling Batch DAGs Inside an ML flow step
Coupling Loose (output table) Tight (lives next to model code)

Last-mile code lives inside the training / inference flow, so the performance bottleneck is "how fast can I get the bytes out of the warehouse into this Python process?" — not "how do I shuffle 10 TB across a cluster?"

How Metaflow serves the last mile

Netflix's Fast Data library sits exactly in this niche:

  • metaflow.Table — Iceberg/Hive metadata parsing, partition
  • Parquet file resolution.
  • metaflow.MetaflowDataFrame — pull Parquet files via the Metaflow high-throughput S3 client directly into process memory ("often outperforms reading of local files").
  • Arrow as the in-memory representation, zero-copy handoff to Pandas / Polars / internal C++ libraries.
  • nanoarrow-style ABI-only dependency on Arrow so the lib doesn't collide with user PyArrow versions.

The foreach primitive horizontally scales a last-mile step across many tasks — each task reads its own shard via Table + MetaflowDataFrame. The Content Knowledge Graph entity-resolution flow processes ~1 billion title pairs this way.

Scope boundary

If you need distributed compute — cross-shard joins, terabyte shuffles — you leave the last mile and return to Spark. The discipline is keeping last-mile code out of Spark when a single foreach-sharded Python step can handle the workload.

Seen in

Last updated · 550 distilled / 1,221 read