Skip to content

PATTERN Cited by 4 sources

Streaming broker as lakehouse Bronze sink

Problem. Most organisations running a Kafka-class streaming broker for operational data also run a lakehouse with a Medallion Architecture for analytics. The Bronze tier wants a raw, as-ingested, audit-trail- preserving copy of the same records the broker is already carrying — so there is an implicit ETL pipeline (topic → Bronze Iceberg table) every team ends up building.

Pre-lakehouse-native-broker, this pipeline had two canonical shapes, both with significant operational cost:

  1. Custom Airflow + Python jobs reading from the broker, converting to Parquet, writing to S3, updating the Iceberg catalog. "Required specialized talent to write, test, and maintain them, which is error-prone and time-consuming."
  2. Managed connectors (Kafka Connect, Redpanda Connect, Debezium) running in a separate cluster doing the same job. "Introduce a middleman to the architecture, requiring you to configure and maintain a separate set of clusters for data integration." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Pattern. Make the streaming broker itself double as the Bronze tier of the lakehouse. The broker natively:

  1. Keeps the log segments for topic semantics (consumer groups, replay, retention).
  2. Writes the same data as columnar Parquet files to object storage.
  3. Registers the resulting tables with the external Iceberg REST catalog the lakehouse uses.

Downstream Iceberg-compatible engines (ClickHouse, Snowflake, Databricks, Trino, Flink, Spark) query the tables as if the broker were a first-class lakehouse citizen — without any intermediate pipeline.

Canonical instance: Redpanda Iceberg topics, which realise this pattern as a broker feature. The Iceberg topic concept is the primitive.

When to use

  • Streaming broker produces data the analytics team wants in the lakehouse Bronze tier anyway.
  • Existing ETL pipeline (Airflow / Connect) is operational overhead that doesn't add value beyond data movement.
  • Metadata preservation (offset / partition / timestamp) matters to downstream consumers.
  • The broker team and the lakehouse team want to decouple: the broker team ships topic changes without coordinating with an integration team.
  • Bronze-tier latency tolerance is in seconds-to-minutes (matching the broker's Iceberg-commit cadence), not sub-second.

When NOT to use

  • You need sub-second end-to-end latency from produce to Bronze query — the Iceberg commit interval is a latency floor.
  • Your broker is Apache Kafka (no native Iceberg topic) and you aren't willing to swap — the pattern is currently Redpanda- specific among Kafka-API implementations.
  • You need schema transformations on the way to Bronze — the pattern assumes Bronze is raw-as-ingested; transforms belong on the Bronze→Silver transition.
  • Cost-per-byte is tight and the duplicate-storage overlap (log segments + object-store Parquet during retention) is unacceptable.
  • Your Iceberg catalog is not one of the supported REST-catalog choices (Unity, Polaris, Glue) — pattern depends on external catalog interop.

Trade-offs

Wins: - Eliminates an integration tier — no Airflow DAG, no Kafka Connect cluster, no Python jobs to maintain. - Preserves Kafka-layer metadata in Bronze — offset, partition, timestamp as columns, not lost on ingestion. - Config-only onboarding — existing topics become Iceberg tables without redeploying producers or consumers. - Operational surface shrinks — one fewer cluster to patch, monitor, scale, and pay for.

Costs: - Vendor lock-in at the broker layer (Redpanda-specific feature as of 2025-01). - Duplicate storage during retention window — data in log segments and Parquet files simultaneously. - Commit-cadence latency floor — Bronze rows not queryable via Iceberg until the next snapshot commit. - Compaction / GC ownership unclear — periodic Iceberg housekeeping still has to happen somewhere. - Schema-evolution complexity — Kafka client serializers (Avro / Protobuf via a schema registry) and Iceberg's schema evolution need to stay compatible; integration shape depends on the broker's handling.

Composition

This pattern is the upstream half of a complete broker-native lakehouse flow. The downstream half — Bronze→Silver→Gold transitions — can either be:

  • Batch ELT (canonical): scheduled dbt DAG on Airflow running Iceberg SQL transformations per tier. Simple, operator- familiar, latency-insensitive.
  • Streaming ETL (optional): see the companion pattern patterns/stream-processor-for-real-time-medallion-transitions — Flink/Spark jobs reading the Bronze Iceberg table, writing the Silver Iceberg table, then Gold, via Iceberg's Flink sink connector. Near-real-time at the cost of more operational complexity.

A telemetry-to-lakehouse pipeline is the sibling pattern when the upstream is telemetry rather than transactional business data; the broker-native Iceberg-sink move applies equally well there.

Seen in

  • sources/2026-03-05-redpanda-introducing-iceberg-output-for-redpanda-connectSink-connector-altitude variant. Redpanda Connect's 2026-03-05 Iceberg output launch instantiates the Bronze-sink pattern at the sink-connector altitude (stateless K8s container writing to REST-catalog-registered Iceberg tables) rather than the broker-native altitude of Iceberg Topics. The two realisations are complementary — explicitly canonicalised as patterns/sink-connector-as-complement-to-broker-native-integration. Adds three new compositional properties to the Bronze-sink pattern catalogue: (1) data-driven flushing for small-file-problem mitigation; (2) registry-less schema evolution from raw JSON; (3) multi- table routing from one pipeline.

  • sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricksJoint-Redpanda-Databricks instantiation. Pattern framed at tech-talk-recap altitude with Unity Catalog as the specific downstream governance endpoint. The Bronze-sink move is sloganised as "the stream is the table" and "streaming data is analytics-ready by default". Alternatives-displaced verbatim: "Before Iceberg Topics, making streaming data available in a lakehouse typically required significant manual effort. Teams either built custom ETL jobs using frameworks like Spark or deployed heavyweight connector architectures to move data from streaming platforms into analytical systems. To add insult to injury, these pipelines were often brittle and operationally expensive." Four verbatim wins enumerated: lower infrastructure costs + faster time-to-insight + fewer human hours on pipeline maintenance + more free time for data products + AI applications. Medallion tiers are implicit — post only walks Bronze, doesn't enumerate Silver / Gold. No mechanism depth beyond prior sources.

  • sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonical wiki source. Redpanda positions Iceberg topics explicitly as the mechanism that makes the streaming broker serve as the Bronze layer of a Medallion-architected lakehouse without an external ETL system. Pedagogy altitude; no production numbers.
  • sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platformsBronze-sink pattern at vision / backbone altitude. Frames the broker-as-Bronze move as one structural property of a streaming-backbone data platform: "streams can be joined and processed in real time so data lands in any form to maximize its queryability, without expensive batch jobs that constantly reprocess the entire dataset." Names both Iceberg Topics (open-format route) and Snowpipe Streaming (proprietary-format route) as the two concrete landing mechanisms. Thought-leadership altitude; no numbers.
Last updated · 470 distilled / 1,213 read