PATTERN Cited by 1 source

Stream processor for real-time medallion transitions¶

Problem. The canonical Medallion Architecture implementation has Bronze→Silver→Gold transitions as scheduled batch jobs — SQL transformations authored in dbt, scheduled by Apache Airflow (or Dagster, Prefect). Simple, operator-familiar, correct. But there's a structural cost: latency accumulates per hop.

Redpanda names it explicitly:

"these ELT/ETL tools do a great job transforming Iceberg tables between layers, they typically run as scheduled batch jobs, adding a noticeable delay to the insights derived at the Gold layer. From a business perspective, the only drawback of this 'multi-hop' architecture is the increased time to value, as it takes time for the data to traverse several layers." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

A row that landed in Bronze at 10:00 is not in Gold until the next Silver job runs (say, hourly) and then the next Gold job (say, daily). End-to-end latency: up to 24+ hours. For dashboards or features that need near-real-time freshness, that's a non-starter.

Pattern. Replace the scheduled batch transformations between medallion tiers with streaming jobs reading and writing open-table-format tables continuously.

Canonical implementation: Apache Flink with Iceberg's Flink connector. A Flink job reads the Bronze Iceberg table as a continuous stream (Iceberg incremental reads on new snapshots), applies transformations (filters, joins, enrichments, windowed aggregations), and writes to the Silver Iceberg table via Flink's Iceberg sink connector. A second Flink job does the same for Silver→Gold.

Alternatives at the same architectural role:

Apache Spark Structured Streaming with Iceberg connectors.
Streaming databases that natively support Iceberg source/sink (Materialize, RisingWave).
Kafka-native stream-processing libraries (Kafka Streams, ksqlDB) when the tier transitions can happen on a Kafka topic before the Iceberg sink commits — pairs well with Iceberg topics at the Bronze tier.

When to use¶

Gold-tier data powers real-time dashboards, alerts, or online ML inference where minute-scale (not day-scale) freshness matters.
Bronze source is already a streaming substrate (Kafka / Redpanda Iceberg topic / CDC log) — the upstream is naturally continuous.
The transformations are stateless or small-state (filter, project, enrich from reference data, simple aggregates). Heavy stateful operations (large windowed joins, complex deduplication) push Flink's operational cost up sharply.
The team has stream-processing operational experience (or is willing to build it).

When NOT to use¶

Data freshness requirements are loose (hours-to-days acceptable). Scheduled batch dbt / Airflow is simpler and cheaper.
Transformations rely on reference data that only updates daily. Batch alignment is natural; streaming is overkill.
Team has dbt / SQL expertise but no Flink / Spark expertise. Streaming-ETL operational burden is non-trivial — checkpoint management, state-backend tuning, restart / replay semantics, watermark / event-time correctness.
Correctness guarantees need ad-hoc reprocessing flexibility. Batch pipelines let you just re-run; streaming pipelines need careful replay design (Iceberg's snapshot versioning helps here but doesn't remove the burden).
Transformations involve complex multi-source joins that aren't naturally expressible in a streaming semantics. Batch SQL is more forgiving than streaming SQL for this.

Composition¶

This pattern sits downstream of patterns/streaming-broker-as-lakehouse-bronze-sink in a fully streaming-native medallion stack:

producers
   │
   ▼
┌────────────────────────┐
│ Streaming broker with  │   ← streaming-broker-as-lakehouse-bronze-sink
│ Iceberg topic (Bronze) │     pattern: broker writes Iceberg directly
└─────────┬──────────────┘
          │ Iceberg incremental read
          ▼
    ┌──────────────┐
    │ Flink job    │        ← this pattern (Bronze → Silver transition)
    └─────┬────────┘
          │ Iceberg sink
          ▼
┌────────────────────────┐
│  Silver Iceberg table  │
└─────────┬──────────────┘
          │ Iceberg incremental read
          ▼
    ┌──────────────┐
    │ Flink job    │        ← this pattern (Silver → Gold transition)
    └─────┬────────┘
          │ Iceberg sink
          ▼
┌────────────────────────┐
│   Gold Iceberg table   │   ← queried by BI / ML / dashboards
└────────────────────────┘

The result is end-to-end medallion latency on the order of the Iceberg commit interval (seconds-to-minutes), versus batch-ELT latency on the order of the slowest scheduled job (hours-to-days).

Trade-offs¶

Wins: - End-to-end latency collapses from hours/days to seconds/minutes. - Compute amortised continuously rather than in batch spikes at job-schedule times. - Backpressure-aware — stream processors natively handle upstream rate variation; batch schedulers do not.

Costs: - Operational complexity — Flink/Spark streaming jobs require checkpoint management, state-backend tuning, watermark reasoning, restart / replay design. Much higher operator skill floor than "dbt run + Airflow cron". - Per-second cost — streaming jobs run continuously; batch jobs only during their scheduled window. - Reprocessing friction — re-running a batch job is "just run it again"; re-running a streaming job requires coordinating Iceberg snapshot rollback / replay carefully. - Schema evolution on running jobs — Flink jobs holding state across schema changes need careful migration; batch jobs just pick up the new schema on their next run. - Hybrid architectures get complex — teams often end up with streaming for latency-sensitive Gold tables and batch for the rest, doubling the operational surface (see patterns/hybrid-batch-streaming-ingestion).

Seen in¶

sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — canonical wiki source naming the shift from scheduled batch ELT (dbt + Airflow) to streaming ETL (Flink + Iceberg Flink connector) as the Medallion-latency-reduction move. Frames it as composable with Redpanda's Iceberg topics at the Bronze layer for an end-to-end streaming-native medallion stack.

concepts/medallion-architecture — the architectural context.
concepts/data-lakehouse · concepts/open-table-format — the substrate this pattern operates on.
concepts/elt-vs-etl — the model this pattern converts from batch ELT back toward streaming ETL (the E, T, and L all happen continuously rather than in batch).
systems/apache-flink · systems/apache-spark — the canonical stream processors.
systems/apache-iceberg — the table format whose incremental reads + sink connectors make the pattern possible.
systems/dbt · systems/apache-airflow — the batch alternative this pattern replaces.
patterns/streaming-broker-as-lakehouse-bronze-sink — the upstream pattern this composes with.
patterns/hybrid-batch-streaming-ingestion — the operational reality (mixed streaming + batch per tier / per workload).