CONCEPT Cited by 1 source
Continuous-computation convergence (batch + streaming)¶
Definition¶
Continuous-computation convergence is the observation that the decades-long distinction between batch and streaming data processing collapses once:
- Data warehouses and lakehouses agree on an open table format (settling on Apache Iceberg in 2024), and
- Query engines absorb the temporal complexity (late-arriving events, system-time vs event-time, windowing, watermarking) that was historically forced on the operator.
The result: you use the lakehouse for backfill and the low-latency stream for tailing iterators after you are caught up, and the batch/streaming choice becomes a query-engine internal concern rather than an architectural decision the application has to make.
Canonical statement on the wiki¶
Alex Gallego (Redpanda founder) walks the 20-year batch → streaming → convergence trajectory and attributes the inflection to the Iceberg handshake (Source: Gallego 2025-04-03):
"No one cares about batch and streaming differences today. For many years, we debated the mental model that streaming was a superset of batch once you added a cron-like timer callback. Last year, from customer pressure, the largest data warehouse products settled on Iceberg as an open-source data format. With that handshake, the world began thinking about data as a continuous computation. Use the lakehouse for backfill and the low latency stream for tailing iterators after you are caught up, but fundamentally push the complexity to the compute engines."
Why batch persisted for so long¶
Gallego's framing: batch was a historical artefact driven by the absence of two things —
- Mental tools for incremental computation — specifically differential dataflow and DBSP, which give developers tractable reasoning about the output-delta of a continuous computation given an input-delta.
- Intuitive industrial streaming implementations that let engineers reason about continuous dataflow at the same comfort level as Unix pipes (his framing: "Batch was the mental extension of the Unix pipe. Simple stages that one could reason about separately and debug. A simple mental model that scaled…slowly.").
Once brokers like Redpanda and query engines like Flink / Databricks / Snowflake absorbed both gaps, the convergence became natural.
The Iceberg handshake specifically¶
The specific 2024 event Gallego cites: the major warehouse vendors (Databricks, Snowflake, BigQuery) converging on Iceberg as the open-source data format. Concrete mechanism on the streaming side: Redpanda's Iceberg topics — "elevated the mental model of what used to be a bag of bytes to one with schemas — and evolutions — that could be projected and consumed by warehouse technologies like BigQuery, Databricks, and Snowflake." See the 2025-01-21 Medallion post for the full mechanism disclosure.
Complexity does not disappear — it relocates¶
Gallego's honest acknowledgment that the temporal-world problems haven't gone away:
"the world is not nicely delineated hourly event-horizons where you simply accumulate the costs for your accounting systems and et-voilà. The world is filled with complexities of time, from late arriving events to the difference between system times and event times, to byzantine systems or attackers that will try to temper the authenticity of messages with manually curated timestamps to reorder sequencing."
The claim is that these complexities are pushed into the compute engines — not eliminated — so the application developer and the infrastructure operator see a unified interface.
Consequences for wiki-adjacent patterns¶
- patterns/streaming-broker-as-lakehouse-bronze-sink — the concrete realisation: the streaming broker itself serves as the lakehouse Bronze tier, no external ETL.
- concepts/medallion-architecture — the Bronze/Silver/Gold framing works identically whether transitions are batch-dbt or streaming-Flink.
- concepts/elt-vs-etl — convergence means ELT-in-the-warehouse and streaming-ETL are the same story from the application's perspective.
Seen in¶
- sources/2025-04-03-redpanda-autonomy-is-the-future-of-infrastructure — Gallego's 20-year-arc framing canonicalising the batch/streaming convergence as the prerequisite for agent-era infrastructure (code-in-control needs unified historical + real- time access via a single tool).