Skip to content

CONCEPT Cited by 2 sources

Small file problem on object storage

The small file problem is the pathology of a streaming-to-lakehouse pipeline producing many small object-store files (Parquet / ORC / Avro) instead of fewer, well-sized ones. The canonical symptom is that downstream query performance, metadata consistency, and operational cost all degrade roughly linearly with file count at a given data volume.

Why it hurts

For a target table of fixed total byte size, splitting the data across N small files vs fewer large files multiplies several O(N) costs:

  • Listing cost — the Iceberg manifest plus the underlying object- store listing both pay per-file overhead. Spark/Trino/Snowflake query planners read every manifest entry to compute the scan set; tables with millions of small files can see planning time dominate query latency.
  • Parquet open + footer amortisation — each Parquet file carries a footer with column offsets and statistics. Per-file open cost is dominated by the footer round-trip for small files; column-read throughput is suppressed when the number of columns scanned is small relative to footer overhead.
  • Iceberg metadata bloat — every file is a manifest entry, every commit produces a snapshot. Snapshot-based query engines retain manifest history for time travel; small-file-heavy tables produce manifest chains that are expensive to traverse.
  • Compaction burn — the downstream fix is a recurring compaction job that merges small files into larger ones; this costs compute, IO, and operator attention (job scheduling, failure recovery, snapshot coordination).
  • Per-request object-store cost — S3, GCS, ADLS all charge per- PUT / per-GET. Small files amplify the number of PUTs during write and GETs during scan, both billed.

Root cause in streaming sinks

In a streaming sink, the flush trigger shape determines small-file risk. Timer-driven flushing — the Kafka Connect-era default — writes one object per flush interval regardless of data volume. On quiet or bursty streams, this produces many tiny files.

concepts/data-driven-flushing (Redpanda Connect Iceberg output, 2026-03-05 launch) is canonicalised on the wiki as the mitigation pattern: flush only when data is present, letting batch size track workload.

The Redpanda Iceberg output canonicalisation

The Iceberg output for Redpanda Connect names the small-file problem explicitly as a foil verbatim (Source: sources/2026-03-05-redpanda-introducing-iceberg-output-for-redpanda-connect):

"Redpanda Connect uses data-driven flushing. It only executes a flush operation when there is actual data to move, preventing the 'small file problem' on object storage and ensuring you aren't wasting compute cycles on empty operations."

The pathology appears in scare-quotes in the launch post — recognised as an industry-canonical term in the streaming-to-lakehouse community.

  • Iceberg snapshot cadence — every commit creates a snapshot; commit frequency and file-flush frequency are coupled.
  • Partition granularity — finer partition spec (e.g. hourly vs daily) amplifies file count proportionally.
  • Compaction policy — some lakehouse runtimes (e.g. AWS S3 Tables, Databricks Auto-optimize) ship managed compaction; self-managed Iceberg deployments need recurring compaction jobs.

Seen in

Last updated · 470 distilled / 1,213 read