CONCEPT Cited by 3 sources

Medallion Architecture¶

The Medallion Architecture is a three-tier pattern for organising data inside a data lakehouse, structured by progressive data quality. Each tier is a named "colour" of refinement:

Bronze — raw, unprocessed data as-ingested from source. Preserves original form, structure, errors, duplicates, and inconsistencies. Provides the audit trail and the reprocessing substrate — you can always rebuild later layers from Bronze.
Silver — cleansed, deduplicated, enriched data. Standardised formats, null-handling, joins against reference / metadata sources. Intermediate-analytics-ready; bridge between raw and insight.
Gold — aggregated, business-ready data. Optimised for BI dashboards, reporting, and ML training. High usability, low granularity.

Coined and popularised by Databricks; adopted by Microsoft's OneLake (the central data store in Microsoft Fabric, launched 2023) as an endorsement of the pattern at hyperscaler scale.

"As data progresses through these layers, it undergoes increasing refinement, resulting in higher business value at each stage." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Why it's a pattern, not a product¶

Medallion is agnostic to compute engine and table format — it prescribes layer semantics, not a specific implementation. Typical implementations in practice:

Table-format substrate: Iceberg, Delta Lake, or Apache Hudi — an open table format providing ACID + schema evolution + time travel.
File-format substrate: Parquet or ORC — columnar for query efficiency.
Layer-transition engine: dbt for SQL-authored ELT DAGs scheduled by Airflow, or Flink / Spark for streaming-ETL layer transitions.
Query engines over Gold tables: Databricks, Snowflake, Trino, Presto, ClickHouse, Dremio.

Properties¶

Lineage by layer — every Gold row is traceable to a Silver row which is traceable to a Bronze row. Reprocessing after a bug in the Silver SQL is "re-run from Bronze", not "re-ingest from source".
Separation of refinement cost from retention cost — Bronze is cheap, large, rarely queried; Gold is small, expensive to compute, heavily queried. Storage class + compute allocation can diverge per layer.
Schema evolution staged per layer — Bronze can accept heterogeneous raw payloads; Silver imposes the canonical schema; Gold imposes the BI-consumable schema. Changes at each boundary are independent.
Partial-failure isolation — a broken Silver transformation doesn't corrupt Bronze; a broken Gold aggregation doesn't corrupt Silver. Every layer is rebuildable from the previous one.

Failure modes / anti-patterns¶

Skipping Bronze as raw — if ingestion does light cleansing "to save space", the audit trail is gone and reprocessing is impossible. Bronze must be the unaltered source of truth.
Dual-writing Silver directly — bypassing Bronze because "we know what we want" trades off audit trail + schema-evolution capacity for marginal storage savings. Almost always wrong.
Conflating Silver and Gold — one large "analytics" table serving both intermediate queries and dashboards makes schema changes expensive and aggregation cost unbounded. Separate the layers.
Batch-only layer transitions when latency matters — the standard dbt + Airflow pattern schedules Bronze→Silver→Gold as batch; a bronze-landed row can take hours to reach Gold. If business wants near-real-time, the layer transitions need to be streaming (see patterns/stream-processor-for-real-time-medallion-transitions).
Skipping the table format — writing raw Parquet at each layer without Iceberg/Delta/Hudi gives up ACID + schema evolution + snapshotting. Near-impossible to operate safely at scale.

Composition with streaming substrates¶

A streaming broker naturally produces Bronze-shaped data — an append- only log of as-received events. The question is how that log becomes a queryable table. Two architectural moves:

External ETL from broker to lakehouse — Kafka Connect / Redpanda Connect / Debezium reads the topic, converts, and writes to the Bronze Iceberg table. Introduces an integration cluster to operate.
Broker-native Iceberg sink — the broker writes topic data directly as Iceberg rows in object storage, updating the Iceberg catalog transparently. Canonical instance: Redpanda Iceberg topics. This is the streaming- broker-as-lakehouse-Bronze-sink pattern.

Seen in¶

sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Medallion + Delta CDF + versioned mapping registry as the Entity Resolution audit-chain substrate. New Medallion face on the wiki: not just modality-vertical (the 2026-04-22 multimodal healthcare face) or canonical-streaming-broker- as-Bronze (the 2025-01-21 Redpanda face), but the Bronze raw → Silver canonical promotion pipeline driven by Delta Change Data Feed with a versioned mapping registry as the classifier. "The journey begins in the Bronze layer, where raw, heterogeneous JSON payloads are captured in append-only Delta tables. From there, a promotion pipeline — reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema. By utilizing Delta Lake's schema evolution and time travel, Claroty maintains an unbreakable chain of custody; every asset record is traceable back to its original raw artifact and the specific mapping version that classified it." The new architectural move: the mapping registry is itself versioned, so the audit chain anchors on both data lineage and classifier lineage. Composes Delta CDF as the layer-transition trigger with schema evolution + Delta time travel for the audit substrate. Canonical wiki instance: systems/claroty-cps-library (17M+ asset CPS Entity Resolution catalog).
sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Medallion-per-modality face: Databricks' 30-day pilot playbook prescribes "stand up governed bronze / silver / gold tables secured via Unity Catalog" as step 3 of a multimodal integration project. Each modality (genomics, imaging features, clinical-notes entities, wearables aggregates) follows the three-layer progression independently inside one UC governance surface — which is how Medallion composes with governed-Delta- tables-per-modality (Medallion is per-modality vertical; the per-modality pattern is the horizontal fan-out).
sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonical wiki source for Medallion Architecture as a concept. Walks the three-layer progression, names the Databricks origin + Microsoft OneLake adoption, positions Redpanda's Iceberg topics as the mechanism that lets a streaming broker serve as the Bronze layer natively, and argues that Flink's Iceberg sink connector can replace scheduled batch dbt + Airflow runs for Silver/Gold transitions when real-time latency matters. Pedagogy altitude, no production numbers.

concepts/data-lakehouse — the architectural class Medallion lives in.
concepts/open-table-format · concepts/open-file-format — the storage substrate Medallion layers sit on top of.
concepts/elt-vs-etl — Medallion is canonically an ELT architecture (transform inside the warehouse/lakehouse via SQL).
systems/apache-iceberg — the typical table-format layer.
systems/databricks — the pattern's originator.
patterns/streaming-broker-as-lakehouse-bronze-sink — Bronze-tier instantiation when the upstream is a Kafka-API-compatible broker.
patterns/stream-processor-for-real-time-medallion-transitions — real-time alternative to batch dbt + Airflow for layer transitions.
patterns/telemetry-to-lakehouse — sibling pattern where the upstream is telemetry rather than transactional data.