Skip to content

PATTERN Cited by 1 source

Flush cadence for file layout, not freshness

Set the streaming-to-Iceberg flush interval to produce optimally-sized Parquet files (32–100+ MB) rather than to meet the analytics freshness SLA, because a bridge-query-capable engine covers the freshness gap by reading un-flushed records directly from the topic at query time.

Pattern shape

┌─────────────────────────────────────────────┐
│  Traditional: flush every 30s for freshness │
│  → 500 KB files, constant compaction        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  With bridge query: flush every 1–6 hours   │
│  → 32–64 MB files, no compaction needed     │
│  → Topic covers the freshness gap           │
└─────────────────────────────────────────────┘

When to apply

  • You have a streaming-to-lakehouse pipeline landing data in Iceberg.
  • Your analytics freshness requirement is sub-minute but your optimal file size demands multi-hour flush intervals.
  • Your query engine supports bridge queries or equivalent transparent two-tier reads.

Configuration (Redpanda)

  1. Raise lag target: rpk topic alter-config orders --set redpanda.iceberg.target.lag.ms=3600000
  2. (Optional) Raise flush threshold: rpk cluster config set datalake_translator_flush_bytes 67108864
  3. Rule of thumb: throughput × lag ≈ target file size

Benefits

  • Fewer S3 GETs per query (10 × 100 MB vs 1000 × 1 MB, same bytes scanned)
  • Better compression ratios from larger column-encoding windows
  • Less Iceberg catalog metadata bloat
  • Effective concepts/predicate-pushdown from substantial row groups
  • Compaction service can run rarely or not at all

Seen in

Last updated · 559 distilled / 1,651 read