Skip to content

CONCEPT Cited by 1 source

Stream-based file processing

Definition

Stream-based file processing is the discipline of processing files one record at a time (or one chunk at a time) rather than loading the full file into memory. It bounds per-job memory consumption to the record size, not the file size — enabling one worker process to handle files orders of magnitude larger than its RAM.

The classical anti-pattern it displaces: "load the CSV into a dataframe, iterate, write the result." Works fine for MB-scale files, falls over at GB-scale, crashes at 10×+ GB.

Why it shows up at LLM-batch scale

The Instacart Maple post (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27) explicitly names stream-based processing as a scale-forced fix:

"As our internal clients sent larger and larger input files, we hit storage, memory, and processing limitations. … We adopted stream-based processing to minimize memory consumption when handling large files."

With LLM batch APIs capped at 50K prompts / 200 MB per batch, a 10M-prompt job is hundreds of GB of intermediate state — orders of magnitude beyond any reasonable per-process memory budget.

Three specific Maple optimisations compose into stream processing:

  • Data on S3 not in a database — avoids loading through a query boundary that forces full materialisation.
  • Parquet instead of CSV — columnar format with per-column compression (~25×) + random-access support means reads can be selective (only the columns needed) and incremental (one row-group at a time).
  • orjson instead of Python stdlib json — faster + more memory-efficient JSON parser. Per-record cost matters when the per-record count is in the tens of millions.

Generalisation

Stream processing applies anywhere a pipeline's input grows beyond what a single process can fit in RAM:

  • ETL pipelines — process records one at a time; emit to output sink as you go.
  • Log processing — scan sequentially, aggregate on the fly.
  • Large-file uploadsmultipart/chunked bodies, each chunk acknowledged before the next is read.
  • Database migrations — cursor-based row-by-row read, not SELECT *.
  • CSV/Parquet ETL — use library primitives that return iterators / chunked readers, not full-file loaders.

The generalisation is: memory consumption should be O(record size) not O(file size). Once a pipeline crosses into O(file size) memory territory, it becomes a ticking ceiling — eventual input growth crashes it.

Implementation knobs

  • Chunk size: larger chunks amortise per-chunk overhead but raise peak RSS. Typical sweet spot is 1-10 MB at the columnar-file layer.
  • Backpressure: when the downstream is slower than the reader, throttle the reader — otherwise in-flight chunk queue unbounded. See concepts/backpressure.
  • Parallelism: chunks can usually be processed concurrently; need to preserve order only at merge time.
  • Fault tolerance: chunk-granular state lets jobs resume from the last successfully-processed chunk on crash — composes with concepts/durable-execution substrates like Temporal.

Seen in

Last updated · 319 distilled / 1,201 read