PATTERN Cited by 1 source
Stream processor for real-time medallion transitions¶
Problem. The canonical Medallion Architecture implementation has Bronze→Silver→Gold transitions as scheduled batch jobs — SQL transformations authored in dbt, scheduled by Apache Airflow (or Dagster, Prefect). Simple, operator-familiar, correct. But there's a structural cost: latency accumulates per hop.
Redpanda names it explicitly:
"these ELT/ETL tools do a great job transforming Iceberg tables between layers, they typically run as scheduled batch jobs, adding a noticeable delay to the insights derived at the Gold layer. From a business perspective, the only drawback of this 'multi-hop' architecture is the increased time to value, as it takes time for the data to traverse several layers." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)
A row that landed in Bronze at 10:00 is not in Gold until the next Silver job runs (say, hourly) and then the next Gold job (say, daily). End-to-end latency: up to 24+ hours. For dashboards or features that need near-real-time freshness, that's a non-starter.
Pattern. Replace the scheduled batch transformations between medallion tiers with streaming jobs reading and writing open-table-format tables continuously.
Canonical implementation: Apache Flink with Iceberg's Flink connector. A Flink job reads the Bronze Iceberg table as a continuous stream (Iceberg incremental reads on new snapshots), applies transformations (filters, joins, enrichments, windowed aggregations), and writes to the Silver Iceberg table via Flink's Iceberg sink connector. A second Flink job does the same for Silver→Gold.
Alternatives at the same architectural role:
- Apache Spark Structured Streaming with Iceberg connectors.
- Streaming databases that natively support Iceberg source/sink (Materialize, RisingWave).
- Kafka-native stream-processing libraries (Kafka Streams, ksqlDB) when the tier transitions can happen on a Kafka topic before the Iceberg sink commits — pairs well with Iceberg topics at the Bronze tier.
When to use¶
- Gold-tier data powers real-time dashboards, alerts, or online ML inference where minute-scale (not day-scale) freshness matters.
- Bronze source is already a streaming substrate (Kafka / Redpanda Iceberg topic / CDC log) — the upstream is naturally continuous.
- The transformations are stateless or small-state (filter, project, enrich from reference data, simple aggregates). Heavy stateful operations (large windowed joins, complex deduplication) push Flink's operational cost up sharply.
- The team has stream-processing operational experience (or is willing to build it).
When NOT to use¶
- Data freshness requirements are loose (hours-to-days acceptable). Scheduled batch dbt / Airflow is simpler and cheaper.
- Transformations rely on reference data that only updates daily. Batch alignment is natural; streaming is overkill.
- Team has dbt / SQL expertise but no Flink / Spark expertise. Streaming-ETL operational burden is non-trivial — checkpoint management, state-backend tuning, restart / replay semantics, watermark / event-time correctness.
- Correctness guarantees need ad-hoc reprocessing flexibility. Batch pipelines let you just re-run; streaming pipelines need careful replay design (Iceberg's snapshot versioning helps here but doesn't remove the burden).
- Transformations involve complex multi-source joins that aren't naturally expressible in a streaming semantics. Batch SQL is more forgiving than streaming SQL for this.
Composition¶
This pattern sits downstream of patterns/streaming-broker-as-lakehouse-bronze-sink in a fully streaming-native medallion stack:
producers
│
▼
┌────────────────────────┐
│ Streaming broker with │ ← streaming-broker-as-lakehouse-bronze-sink
│ Iceberg topic (Bronze) │ pattern: broker writes Iceberg directly
└─────────┬──────────────┘
│ Iceberg incremental read
▼
┌──────────────┐
│ Flink job │ ← this pattern (Bronze → Silver transition)
└─────┬────────┘
│ Iceberg sink
▼
┌────────────────────────┐
│ Silver Iceberg table │
└─────────┬──────────────┘
│ Iceberg incremental read
▼
┌──────────────┐
│ Flink job │ ← this pattern (Silver → Gold transition)
└─────┬────────┘
│ Iceberg sink
▼
┌────────────────────────┐
│ Gold Iceberg table │ ← queried by BI / ML / dashboards
└────────────────────────┘
The result is end-to-end medallion latency on the order of the Iceberg commit interval (seconds-to-minutes), versus batch-ELT latency on the order of the slowest scheduled job (hours-to-days).
Trade-offs¶
Wins: - End-to-end latency collapses from hours/days to seconds/minutes. - Compute amortised continuously rather than in batch spikes at job-schedule times. - Backpressure-aware — stream processors natively handle upstream rate variation; batch schedulers do not.
Costs: - Operational complexity — Flink/Spark streaming jobs require checkpoint management, state-backend tuning, watermark reasoning, restart / replay design. Much higher operator skill floor than "dbt run + Airflow cron". - Per-second cost — streaming jobs run continuously; batch jobs only during their scheduled window. - Reprocessing friction — re-running a batch job is "just run it again"; re-running a streaming job requires coordinating Iceberg snapshot rollback / replay carefully. - Schema evolution on running jobs — Flink jobs holding state across schema changes need careful migration; batch jobs just pick up the new schema on their next run. - Hybrid architectures get complex — teams often end up with streaming for latency-sensitive Gold tables and batch for the rest, doubling the operational surface (see patterns/hybrid-batch-streaming-ingestion).
Seen in¶
- sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonical wiki source naming the shift from scheduled batch ELT (dbt + Airflow) to streaming ETL (Flink + Iceberg Flink connector) as the Medallion-latency-reduction move. Frames it as composable with Redpanda's Iceberg topics at the Bronze layer for an end-to-end streaming-native medallion stack.
Related¶
- concepts/medallion-architecture — the architectural context.
- concepts/data-lakehouse · concepts/open-table-format — the substrate this pattern operates on.
- concepts/elt-vs-etl — the model this pattern converts from batch ELT back toward streaming ETL (the E, T, and L all happen continuously rather than in batch).
- systems/apache-flink · systems/apache-spark — the canonical stream processors.
- systems/apache-iceberg — the table format whose incremental reads + sink connectors make the pattern possible.
- systems/dbt · systems/apache-airflow — the batch alternative this pattern replaces.
- patterns/streaming-broker-as-lakehouse-bronze-sink — the upstream pattern this composes with.
- patterns/hybrid-batch-streaming-ingestion — the operational reality (mixed streaming + batch per tier / per workload).