Skip to content

REDPANDA 2025-01-21

Read original ↗

Redpanda — Implementing the Medallion Architecture with Redpanda

Summary

Redpanda (2025-01-21) publishes a pedagogy-altitude explainer on the Medallion Architecture — Databricks' three-tier Bronze / Silver / Gold data-storage pattern for data lakes and lakehouses — and positions Redpanda's topic-level Apache Iceberg integration as the mechanism that lets a streaming broker serve as the Bronze layer directly, with no intermediate ETL hop. The post walks the three layers (raw ingest → cleansed / enriched → analytics-ready / aggregated), canonicalises open file formats (Parquet, ORC) + open table formats (Iceberg) as the dual-layer storage substrate, then argues that Redpanda's Iceberg topics collapse the producer → Bronze data-movement step into a configuration flip: "writing their row-oriented data into Redpanda topics, and then Redpanda will take care of everything that happens after that — projecting row-oriented records into a column-oriented file format, reliably writing them to the object storage, and making required updates to the Iceberg catalog." Second-order claim: when the Bronze layer is a streaming broker, the Silver / Gold transitions can happen via stream processing (Flink's Iceberg sink connector) rather than scheduled batch dbt + Airflow runs, shrinking end-to-end latency to near-real-time.

Key takeaways

  1. The Medallion Architecture is a three-tier data-quality pattern, not a product: Bronze holds raw, as-is ingested data (audit trail + lineage); Silver holds cleansed, deduplicated, enriched data (structured for intermediate analytics); Gold holds aggregated, business-ready data (BI dashboards, ML training). Originated at Databricks; adopted by Microsoft's OneLake in Microsoft Fabric (launched 2023). Load-bearing claim verbatim: "As data progresses through these layers, it undergoes increasing refinement, resulting in higher business value at each stage." (Source: post body — concepts/medallion-architecture)

  2. Open file formats + open table formats form a two-layer storage substrate. File formats like Parquet and ORC give columnar compression + query efficiency at the object level; table formats like Apache Iceberg add the metadata layer (schema evolution, ACID transactions, snapshot versioning, partitioning) that file formats alone can't provide. Canonical framing: "storing only data in open file formats wouldn't be sufficient. We need a metadata layer on top of it to provide transactional guarantees, schema evolution, and many more. This is where table formats come into the scene." (Source: post body — concepts/open-file-format, concepts/open-table-format)

  3. Redpanda's Iceberg topics make a streaming broker double as the Bronze layer without an external ETL tool. Verbatim: "Redpanda's newly launched feature, topic-level integration with Iceberg, enables creating Iceberg tables from your topics — storing row-oriented data in a Redpanda topic inside object storage in a columnar format, supporting the Iceberg table format. Tables can be created with or without a schema and registered with your Iceberg table catalog. Once registered, you can query them using popular Iceberg-compatible query engines like ClickHouse, Snowflake, Databricks, Dremio, and many more." (Source: post body — systems/redpanda-iceberg-topics, patterns/streaming-broker-as-lakehouse-bronze-sink)

  4. The provenance/metadata-preservation property is load-bearing. Redpanda's Iceberg topics "incorporate technical metadata from the upstream Redpanda infrastructure, including offset, partition, and timestamp" — so downstream analytics can join on Kafka-layer metadata, not just payload. Without this, the Bronze layer loses the audit trail the streaming layer already had. (Source: post body)

  5. The ETL/connector alternative is explicitly rejected on operational grounds. Redpanda lists the two standard ways to move data from a streaming broker to a lakehouse — custom data-engineering jobs in Python on Airflow, or managed connectors like Redpanda Connect / Kafka Connect — and enumerates the operational costs of each: the first "required specialized talent to write, test, and maintain them, which is error-prone and time-consuming" plus a performance cost from the read-then-write hop; the second "introduce a middleman to the architecture, requiring you to configure and maintain a separate set of clusters for data integration" with no config-based onboarding path for existing topics. (Source: post body)

  6. Bronze→Silver transformations are canonically a dbt-on-scheduled- orchestrator workload, but can be shifted to streaming ETL. The post walks a worked bronze_sensor_datasilver_sensor_data transformation in dbt SQL filtering inactive sensors + temperature outliers + null humidity, with a derived temperature_status column — scheduled via Airflow. Then canonicalises the stream-processing alternative: "we can do these transformations in real time when the data lands in the Bronze layer… by bringing in stream processing to this architecture… Apache Iceberg provides a Flink connector that supports both reading from Iceberg tables, performing transformations, and writing the processed data back into new or existing Iceberg tables. This speeds up the data transformation between layers by shifting ELT pipelines to streaming ETL." (Source: post body — patterns/stream-processor-for-real-time-medallion-transitions)

  7. Optional schema enforcement via the schema registry — Iceberg topics can apply a schema (typed structure in the Bronze Iceberg table) or fall back to a default (record key + value + offset + partition + timestamp). Framed as operator choice: "it is not mandatory to enforce schemas on the data in the Bronze layer" — Medallion's audit-trail property is preserved either way, but typed Bronze tables downstream pay off at Silver/Gold. (Source: post body — concepts/schema-registry)

Systems named

Concepts extracted

Patterns extracted

Operational / architectural numbers

None stated — this is a pedagogy-altitude explainer. No throughput, latency, or cost numbers. Architecture-density is moderate (~50%) with diagrams inlined (not extracted here).

Caveats / quality notes

  • Vendor-marketing frame. The post is explicitly positioning Redpanda's Iceberg topics feature; the Medallion walkthrough is framing to motivate the feature pitch. Not a retrospective, not a benchmark, not a production incident. Technical substance is accurate but the narrative is load-bearing-in-favour-of-Redpanda.
  • Tier-3 source. Redpanda is a streaming-platform vendor (tier-3 per this wiki); stricter content filter applies. Borderline-case test passes on vocabulary-canonicalisation grounds (Medallion Architecture as a first-class wiki concept was a gap — referenced implicitly across several telemetry/lakehouse pages but never defined) + streaming-broker-as-Bronze-sink grounds (novel architectural move Iceberg-topics enables). Fails on production-numbers grounds (zero) and on ingest-altitude grounds (pedagogy, not internals).
  • R1 engine referenced but deferred. Post mentions Redpanda's "upcoming" R1 multi-modal engine as the broader vision the Iceberg integration is a piece of; R1 substrate details absent.
  • Catalog / REST-catalog detail thin. Post names "external Iceberg REST catalogs like Databricks Unity and Snowflake Polaris" as supported; doesn't walk the catalog protocol, partition specification, snapshot metadata, or compaction handling. Each of these is a substantive operational surface that the post glosses.
  • Compaction / GC handling unaddressed. Iceberg's operational burden — periodic compaction of small deltas, superseded-snapshot GC — sits with someone; the post doesn't say whether Redpanda owns it, the REST catalog owns it, or the customer owns it. This is the load-bearing externalisation-cost question Iceberg's own wiki entry canonicalises; the post elides it.
  • Streaming-ETL-via-Flink claim unquantified. "Speeds up the data transformation between layers" is asserted without a latency number, a throughput number, or a comparison point against the batch-dbt baseline. Reader has to take the claim on framing.
  • Schema-enforcement failure modes unexplored. "Not mandatory to enforce schemas" is stated but the consequences of schema-less Bronze (heterogeneous row shapes, downstream casting pain, schema-evolution ambiguity) are not walked.
  • Cost framing absent. Iceberg-topics writes to object storage at the same time Kafka retains the log on broker storage — doubling the storage footprint during the retention window — but the post doesn't flag the duplicate-storage cost.
  • Dunith Dhanushka — Redpanda's pedagogy/devrel voice; related Redpanda posts are commonly attributed to him though the 2025-01-21 post byline is company-level. Not a database-internals engineer; framing is architecture-pedagogy altitude, not mechanism altitude.

Source

Last updated · 470 distilled / 1,213 read