Skip to content

CONCEPT Cited by 1 source

Medallion Architecture

The Medallion Architecture is a three-tier pattern for organising data inside a data lakehouse, structured by progressive data quality. Each tier is a named "colour" of refinement:

  • Bronze — raw, unprocessed data as-ingested from source. Preserves original form, structure, errors, duplicates, and inconsistencies. Provides the audit trail and the reprocessing substrate — you can always rebuild later layers from Bronze.
  • Silver — cleansed, deduplicated, enriched data. Standardised formats, null-handling, joins against reference / metadata sources. Intermediate-analytics-ready; bridge between raw and insight.
  • Gold — aggregated, business-ready data. Optimised for BI dashboards, reporting, and ML training. High usability, low granularity.

Coined and popularised by Databricks; adopted by Microsoft's OneLake (the central data store in Microsoft Fabric, launched 2023) as an endorsement of the pattern at hyperscaler scale.

"As data progresses through these layers, it undergoes increasing refinement, resulting in higher business value at each stage." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Why it's a pattern, not a product

Medallion is agnostic to compute engine and table format — it prescribes layer semantics, not a specific implementation. Typical implementations in practice:

  • Table-format substrate: Iceberg, Delta Lake, or Apache Hudi — an open table format providing ACID + schema evolution + time travel.
  • File-format substrate: Parquet or ORC — columnar for query efficiency.
  • Layer-transition engine: dbt for SQL-authored ELT DAGs scheduled by Airflow, or Flink / Spark for streaming-ETL layer transitions.
  • Query engines over Gold tables: Databricks, Snowflake, Trino, Presto, ClickHouse, Dremio.

Properties

  • Lineage by layer — every Gold row is traceable to a Silver row which is traceable to a Bronze row. Reprocessing after a bug in the Silver SQL is "re-run from Bronze", not "re-ingest from source".
  • Separation of refinement cost from retention cost — Bronze is cheap, large, rarely queried; Gold is small, expensive to compute, heavily queried. Storage class + compute allocation can diverge per layer.
  • Schema evolution staged per layer — Bronze can accept heterogeneous raw payloads; Silver imposes the canonical schema; Gold imposes the BI-consumable schema. Changes at each boundary are independent.
  • Partial-failure isolation — a broken Silver transformation doesn't corrupt Bronze; a broken Gold aggregation doesn't corrupt Silver. Every layer is rebuildable from the previous one.

Failure modes / anti-patterns

  • Skipping Bronze as raw — if ingestion does light cleansing "to save space", the audit trail is gone and reprocessing is impossible. Bronze must be the unaltered source of truth.
  • Dual-writing Silver directly — bypassing Bronze because "we know what we want" trades off audit trail + schema-evolution capacity for marginal storage savings. Almost always wrong.
  • Conflating Silver and Gold — one large "analytics" table serving both intermediate queries and dashboards makes schema changes expensive and aggregation cost unbounded. Separate the layers.
  • Batch-only layer transitions when latency matters — the standard dbt + Airflow pattern schedules Bronze→Silver→Gold as batch; a bronze-landed row can take hours to reach Gold. If business wants near-real-time, the layer transitions need to be streaming (see patterns/stream-processor-for-real-time-medallion-transitions).
  • Skipping the table format — writing raw Parquet at each layer without Iceberg/Delta/Hudi gives up ACID + schema evolution + snapshotting. Near-impossible to operate safely at scale.

Composition with streaming substrates

A streaming broker naturally produces Bronze-shaped data — an append- only log of as-received events. The question is how that log becomes a queryable table. Two architectural moves:

  1. External ETL from broker to lakehouse — Kafka Connect / Redpanda Connect / Debezium reads the topic, converts, and writes to the Bronze Iceberg table. Introduces an integration cluster to operate.
  2. Broker-native Iceberg sink — the broker writes topic data directly as Iceberg rows in object storage, updating the Iceberg catalog transparently. Canonical instance: Redpanda Iceberg topics. This is the streaming- broker-as-lakehouse-Bronze-sink pattern.

Seen in

  • sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpandacanonical wiki source for Medallion Architecture as a concept. Walks the three-layer progression, names the Databricks origin + Microsoft OneLake adoption, positions Redpanda's Iceberg topics as the mechanism that lets a streaming broker serve as the Bronze layer natively, and argues that Flink's Iceberg sink connector can replace scheduled batch dbt + Airflow runs for Silver/Gold transitions when real-time latency matters. Pedagogy altitude, no production numbers.
Last updated · 470 distilled / 1,213 read