Skip to content

PATTERN Cited by 1 source

CSV-in / Parquet-intermediate / output-merge

Intent

Accept user-friendly file formats (CSV or Parquet) at the service boundary, but immediately convert to a columnar intermediate format (Parquet) for all internal processing — batch splits, per-batch outputs, merge staging. The output file mirrors the input format so callers see a clean CSV-in / CSV-out abstraction and never observe the Parquet intermediates.

Why

Three distinct shape-of-data concerns get three different formats:

Stage Format Why
Caller I/O CSV or Parquet What users have + what they want back
Intermediate batches Parquet ~25× size reduction vs CSV + random-access reads + per-column compression
Downstream merge Parquet Non-linear access to per-batch result files for efficient join-by-task-ID

The pattern composes with stream-based processing — Parquet row-groups are the natural chunk boundary, and columnar reads let the service pull only the columns needed at each stage.

Mechanics (from Maple)

Maple's realisation (sources/2025-08-27-instacart-simplifying-large-scale-llm-processing-with-maple|2025-08-27):

"Implemented in Python, Maple uses PyArrow to efficiently process input files. Large CSV files are split into smaller Parquet batch files, and stored on S3 to avoid costly database usage. Parquet is an efficient file format for data table storage, where out-of-the- box compression reduces file sizes up to 25x compared to CSV. It also allows non-linear access into the file, making data access extremely fast."

The storage shape:

Input (CSV or Parquet on S3)
  │ split into batches respecting provider's 50K/200MB cap
Per-batch Parquet files on S3
  │ encode into provider-specific batch format
Uploaded to provider
  │ poll for completion
Per-batch result Parquet files on S3
  │ merge across all batches, join on task ID
Final output file (CSV or Parquet, matching input format)

Inputs and outputs live on S3 throughout — Maple never loads the full job into memory, which is a precondition for 10M+ prompt jobs.

Why not all-CSV or all-Parquet

  • All CSV internally: row-oriented format wastes space (25× cost), requires full-file scan for column-selective reads, slow compression decode on the hot path.
  • All Parquet externally: many caller teams have CSV tooling already; forcing Parquet at the boundary is gratuitous friction. Pay the CSV→Parquet conversion cost once, at the boundary.

The asymmetry (CSV-accepted, Parquet-internal, output-format-mirrors- input) is the practical optimum.

Storage choice: S3, not a database

Maple's explicit design argument:

"Inputs and outputs are stored in S3, avoiding costly database operations. This approach is not only cheaper but also allows handling large datasets."

A database-backed intermediate store would:

  • Cost more per byte at 10M+ prompt scale.
  • Force full materialisation at query boundaries.
  • Complicate the memory-discipline story.

S3 is cheap, eventually durable, and Parquet-aware (S3 Select / Parquet-aware clients can push predicates down).

Generalisation

The pattern is reusable anywhere a batch pipeline needs to:

  • Accept a user-friendly input format at the boundary.
  • Process in a columnar format internally for memory + compression wins.
  • Return the result in the input's format.

Sibling pattern: patterns/metadata-plus-chunk-storage-stack (Sprites / Fly.io) uses the same split-storage stance at the block level; this pattern is its row-store-ETL analog.

Caveats

  • Parquet write-amplification — every intermediate step emits a new Parquet file; lots of small Parquet files can hurt read performance (the "small-file problem" familiar from Hive / Spark workloads). Mitigate by sizing batches to produce reasonable-size files.
  • Parquet columnar reads help read-heavy workloads; a merge step that joins N result files by task ID and reads every column doesn't benefit from columnar layout as much as a column- selective query would.
  • CSV parsing at the ingest boundary is the one place per-record overhead bites; using an efficient CSV parser (PyArrow CSV reader, not Python csv stdlib) matters.

Seen in

Last updated · 319 distilled / 1,201 read