Skip to content

CONCEPT Cited by 1 source

Data-driven flushing

Data-driven flushing is the sink-connector design where output to the downstream store happens only when there is actual data to flush — not on a fixed-interval timer. The connector's wake-up schedule is driven by the arrival of records, not by a heartbeat.

This inverts the timer-driven flushing default common to Kafka Connect-era sink connectors, where a scheduled flush fires every N seconds regardless of whether any records have been produced. Under a timer default, a quiet stream still pays the cost of per-flush object storage writes (even if the batch is empty or trivially small) and of per-flush commit RTT to the lakehouse catalog.

Two pathologies timer-driven flushing creates

Small file problem on object storage

When the timer interval is short relative to record arrival rate, a quiet or bursty stream produces many tiny object-store files instead of fewer well-sized ones. Canonicalised on the wiki as concepts/small-file-problem-on-object-storage. Downstream query performance degrades because:

  • Object-store listing cost is O(N files).
  • Iceberg metadata bloat — every file is a manifest entry, every commit creates a snapshot.
  • Parquet scan amortisation fails — per-file open + footer read dominates the actual column-read time.
  • Compaction jobs burn compute recovering scan performance.

Idle-workload compute waste

Timer-driven connectors wake every interval, acquire resources, do a round-trip to the catalog, and then discover the batch is empty. Under high-density per-tenant deployment shapes (the Redpanda launch post cites 0.1 vCPU per pipeline), this wake-up cost is a material cloud-budget axis when multiplied across N idle pipelines.

The Redpanda Iceberg output canonicalisation

The Iceberg output connector for Redpanda Connect (launch post 2026-03-05) canonicalises the pattern verbatim (Source: sources/2026-03-05-redpanda-introducing-iceberg-output-for-redpanda-connect):

"Stop paying for quiet data sources and achieve greater resource density. Unlike legacy connectors that heartbeat on a fixed timer regardless of activity, Redpanda Connect uses data-driven flushing. It only executes a flush operation when there is actual data to move, preventing the 'small file problem' on object storage and ensuring you aren't wasting compute cycles on empty operations."

Two wins are named:

  1. Per-tenant cost efficiency — quiet data sources don't incur per-flush object-storage and catalog-commit costs.
  2. Small-file mitigation — flush batches grow naturally to the size the actual record arrival dictates, not to a timer interval's worth of empty or near-empty space.

Trigger-shape taxonomy (wiki scope)

Flush triggers in streaming sink connectors fall along a spectrum:

  • Pure timer-driven — flush every T seconds regardless.
  • Hybrid timer + count/bytes — flush when either T expires OR count ≥ N OR bytes ≥ B. This is the Kafka-Connect default shape and the Snowpipe Streaming batch-tuning shape (concepts/snowpipe-streaming-channel).
  • Data-driven with ceiling — flush happens only on record arrival, but a max-wait ceiling bounds latency for slow streams. Likely the Redpanda Iceberg output shape, though the launch post doesn't disclose whether a latency ceiling exists.
  • Pure event-driven / exactly-once commit — flush on every transactional boundary; strictly count-driven with bounded per- commit size.

The Redpanda Iceberg output launch post doesn't disclose which trigger-shape the connector actually uses; only that the shape is not pure-timer. Follow-up canonicalisation awaits a mechanism disclosure.

Open questions

  • Max-wait ceiling — is there a per-table latency bound for when a batch will force-flush regardless of size? If not, a truly quiet table could never flush.
  • Commit cadence tuning — Iceberg snapshot commit overhead is real; the launch post references "commit tuning" as a docs-level config axis but doesn't walk it.
  • Interaction with multi-table routing — the Iceberg output supports Bloblang-interpolated multi-table routing. Does each logical table have its own flush trigger, or is there a pipeline-level trigger with per-table batching inside?

Relationship to upstream broker batching

Broker-side batching (concepts/effective-batch-size, concepts/batching-latency-tradeoff) governs producer → broker. Data-driven flushing is the analogous trigger shape for broker (or Redpanda Connect) → lakehouse object storage. Both are instances of the general principle: batch-boundary triggers should track workload, not wall-clock, to avoid both idle waste and small-batch amplification.

Seen in

Last updated · 470 distilled / 1,213 read