Skip to content

PATTERN Cited by 1 source

Stream processor for real-time medallion transitions

Problem. The canonical Medallion Architecture implementation has Bronze→Silver→Gold transitions as scheduled batch jobs — SQL transformations authored in dbt, scheduled by Apache Airflow (or Dagster, Prefect). Simple, operator-familiar, correct. But there's a structural cost: latency accumulates per hop.

Redpanda names it explicitly:

"these ELT/ETL tools do a great job transforming Iceberg tables between layers, they typically run as scheduled batch jobs, adding a noticeable delay to the insights derived at the Gold layer. From a business perspective, the only drawback of this 'multi-hop' architecture is the increased time to value, as it takes time for the data to traverse several layers." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

A row that landed in Bronze at 10:00 is not in Gold until the next Silver job runs (say, hourly) and then the next Gold job (say, daily). End-to-end latency: up to 24+ hours. For dashboards or features that need near-real-time freshness, that's a non-starter.

Pattern. Replace the scheduled batch transformations between medallion tiers with streaming jobs reading and writing open-table-format tables continuously.

Canonical implementation: Apache Flink with Iceberg's Flink connector. A Flink job reads the Bronze Iceberg table as a continuous stream (Iceberg incremental reads on new snapshots), applies transformations (filters, joins, enrichments, windowed aggregations), and writes to the Silver Iceberg table via Flink's Iceberg sink connector. A second Flink job does the same for Silver→Gold.

Alternatives at the same architectural role:

  • Apache Spark Structured Streaming with Iceberg connectors.
  • Streaming databases that natively support Iceberg source/sink (Materialize, RisingWave).
  • Kafka-native stream-processing libraries (Kafka Streams, ksqlDB) when the tier transitions can happen on a Kafka topic before the Iceberg sink commits — pairs well with Iceberg topics at the Bronze tier.

When to use

  • Gold-tier data powers real-time dashboards, alerts, or online ML inference where minute-scale (not day-scale) freshness matters.
  • Bronze source is already a streaming substrate (Kafka / Redpanda Iceberg topic / CDC log) — the upstream is naturally continuous.
  • The transformations are stateless or small-state (filter, project, enrich from reference data, simple aggregates). Heavy stateful operations (large windowed joins, complex deduplication) push Flink's operational cost up sharply.
  • The team has stream-processing operational experience (or is willing to build it).

When NOT to use

  • Data freshness requirements are loose (hours-to-days acceptable). Scheduled batch dbt / Airflow is simpler and cheaper.
  • Transformations rely on reference data that only updates daily. Batch alignment is natural; streaming is overkill.
  • Team has dbt / SQL expertise but no Flink / Spark expertise. Streaming-ETL operational burden is non-trivial — checkpoint management, state-backend tuning, restart / replay semantics, watermark / event-time correctness.
  • Correctness guarantees need ad-hoc reprocessing flexibility. Batch pipelines let you just re-run; streaming pipelines need careful replay design (Iceberg's snapshot versioning helps here but doesn't remove the burden).
  • Transformations involve complex multi-source joins that aren't naturally expressible in a streaming semantics. Batch SQL is more forgiving than streaming SQL for this.

Composition

This pattern sits downstream of patterns/streaming-broker-as-lakehouse-bronze-sink in a fully streaming-native medallion stack:

producers
┌────────────────────────┐
│ Streaming broker with  │   ← streaming-broker-as-lakehouse-bronze-sink
│ Iceberg topic (Bronze) │     pattern: broker writes Iceberg directly
└─────────┬──────────────┘
          │ Iceberg incremental read
    ┌──────────────┐
    │ Flink job    │        ← this pattern (Bronze → Silver transition)
    └─────┬────────┘
          │ Iceberg sink
┌────────────────────────┐
│  Silver Iceberg table  │
└─────────┬──────────────┘
          │ Iceberg incremental read
    ┌──────────────┐
    │ Flink job    │        ← this pattern (Silver → Gold transition)
    └─────┬────────┘
          │ Iceberg sink
┌────────────────────────┐
│   Gold Iceberg table   │   ← queried by BI / ML / dashboards
└────────────────────────┘

The result is end-to-end medallion latency on the order of the Iceberg commit interval (seconds-to-minutes), versus batch-ELT latency on the order of the slowest scheduled job (hours-to-days).

Trade-offs

Wins: - End-to-end latency collapses from hours/days to seconds/minutes. - Compute amortised continuously rather than in batch spikes at job-schedule times. - Backpressure-aware — stream processors natively handle upstream rate variation; batch schedulers do not.

Costs: - Operational complexity — Flink/Spark streaming jobs require checkpoint management, state-backend tuning, watermark reasoning, restart / replay design. Much higher operator skill floor than "dbt run + Airflow cron". - Per-second cost — streaming jobs run continuously; batch jobs only during their scheduled window. - Reprocessing friction — re-running a batch job is "just run it again"; re-running a streaming job requires coordinating Iceberg snapshot rollback / replay carefully. - Schema evolution on running jobs — Flink jobs holding state across schema changes need careful migration; batch jobs just pick up the new schema on their next run. - Hybrid architectures get complex — teams often end up with streaming for latency-sensitive Gold tables and batch for the rest, doubling the operational surface (see patterns/hybrid-batch-streaming-ingestion).

Seen in

Last updated · 470 distilled / 1,213 read