Skip to content

PATTERN Cited by 1 source

Low partition cardinality for large files

Keep the Iceberg partition spec coarse (day or hour granularity, low-cardinality fields) so that the flush threshold — not the partition count — controls file size. High partition cardinality multiplies the minimum file count per flush cycle, negating the benefit of raising the flush interval.

Pattern shape

Each partition produces at minimum one file per flush cycle. If: - Partition spec = (hour(timestamp)) → 24 partitions/day → 24 files/day minimum - Partition spec = (day(timestamp)) → 1 partition/day → 1 file/day minimum - Partition spec = (minute(timestamp)) → 1,440 partitions/day → 1,440 files/day minimum

For streaming workloads optimizing file size with bridge queries: - Use (day(redpanda.timestamp)) as partition spec - Let datalake_translator_flush_bytes (size threshold) be the dominant file-size control

When to apply

  • Streaming-to-Iceberg pipelines using bridge queries or delayed flush cadence
  • Workloads where query patterns use time-range filters that coarse partitions still serve
  • Any Iceberg table experiencing small-file-problem symptoms from partition explosion

Seen in

Last updated · 559 distilled / 1,651 read