Redpanda — Build a real-time lakehouse architecture with Redpanda and Databricks¶
Summary¶
Tech-talk recap post (unsigned Redpanda author; ~1,100 words) summarising a joint Redpanda/Databricks session "From Stream to Table: Building a real-time lakehouse architecture with Redpanda and Databricks" with speakers Matt Schumpert (Redpanda) and Jason Reed (Databricks, formerly on Netflix's data team). The post walks the historical arc that produced today's open lakehouse architecture — Apache Hadoop-era data lakes → governance sprawl → Apache Iceberg (Netflix-originated) → early file-based catalogs → Iceberg REST catalog standardisation → broker-native Iceberg Topics → integration with Databricks Unity Catalog — and reframes Redpanda's broker-native Iceberg primitive and Databricks' governance catalog as the two halves of a single real-time lakehouse substrate. Two load-bearing verbatim slogans canonicalise wiki-already-covered primitives at joint-vendor altitude: Schumpert — "The goal of this partnership is to remove the artificial line between real-time data and analytical data" and Redpanda unsigned — "the stream is the table" + "Streaming data is analytics-ready by default". Reed supplies the Netflix-as-origin-of-Iceberg disclosure with the architectural framing "Iceberg provides a foundation that looks and behaves like a warehouse table, while remaining open and cloud-native" and the integration claim "The data shows up already structured, already governed, and already queryable." Borderline Tier-3 include: architecture content ~50% of the body, zero production numbers, heavy marketing-link density (tech-talk promotion, Iceberg Topics use-case page, Unity Catalog product page, documentation cross-links). Passes scope on historical-framing + Netflix-origin-disclosure + joint-vendor- framing grounds rather than new vocabulary — every primitive named is already canonicalised on the wiki (Iceberg, REST catalog, Iceberg topic, Unity Catalog, streaming-broker-as- Bronze-sink pattern, broker-native catalog registration pattern all pre-exist). Zero net-new concepts / patterns / systems.
Key takeaways¶
-
Historical arc of the open lakehouse (Hadoop-era problem → Iceberg solution). Verbatim pedagogy: "In the Apache Hadoop® era, data lakes had major advantages over traditional data warehouses. They enabled schema-on-read, flexible ELT workflows, and support for multi-structured data, all while significantly lowering costs with cloud object storage." followed by the governance-sprawl problem: "as adoption grew, so did complexity. Sprawl became a serious challenge, and multiple teams operating on the same datasets introduced issues around governance and reliability." Netflix named verbatim as the source: "these challenges ultimately led to the creation of Apache Iceberg—initially developed internally at Netflix and later open-sourced as an Apache project." Extends systems/apache-iceberg with the Netflix-origin disclosure at Databricks-speaker altitude (prior wiki coverage had the 2017 open-source date but not the Netflix provenance named by a former-Netflix-employee now at Databricks).
-
Iceberg REST Catalog centralises governance at the catalog layer. Verbatim: "Early Iceberg catalogs were often implemented as collections of files stored directly in object storage. But with many users, workloads, and vendors creating and managing Iceberg tables across shared object storage, metadata sprawl and governance gaps were becoming dire." The REST catalog protocol is then framed as the "single control plane" with three verbatim responsibilities: "Managing permissions and access control. Coordinating concurrent reads and writes. Dynamically granting engines access to data at runtime." Extends concepts/iceberg-catalog-rest-sync with the governance-framing altitude (the prior canonical source at 2025-04-07 framed REST catalog sync as the transport mechanism; this post frames it as the governance endpoint from the catalog-consumer side). Interoperability emphasised verbatim: "Different platforms, written in different languages and running in different environments, can exchange metadata and enforce governance by speaking the same protocol."
-
"The stream is the table" (canonical slogan for concepts/iceberg-topic + [[systems/redpanda-iceberg- topics|Redpanda Iceberg Topics]]). Full section header: "Redpanda Iceberg Topics: the stream is the table." Function verbatim: "Redpanda's Iceberg Topics allow you to store topic data in the cloud in the Iceberg open table format, so you can query real-time data while it's still streaming. This grants you instant analytics on the freshest data without the complexities of traditional ETL processes." Zero-ETL alternatives-displaced claim verbatim: "Before Iceberg Topics, making streaming data available in a lakehouse typically required significant manual effort. Teams either built custom ETL jobs using frameworks like Spark or deployed heavyweight connector architectures to move data from streaming platforms into analytical systems. To add insult to injury, these pipelines were often brittle and operationally expensive." Four verbatim wins from teams adopting the shape: "Lower infrastructure costs. Faster time-to-insight. Fewer human hours spent on pipeline maintenance. More free time to build valuable data products and AI applications."
-
Unity Catalog as governance hub + Redpanda integrates via Iceberg REST API. Four verbatim governance-hub responsibilities: "Fine-grained access control. Consistent security across workloads. Metadata management and lineage. Easy discovery for downstream users." Reed's framing verbatim: "The data shows up already structured, already governed, and already queryable." Integration-mechanism disclosure verbatim: "Redpanda integrates directly with Unity Catalog using the Iceberg REST API. Through this integration, Redpanda registers Iceberg tables, manages schema updates, deletes tables when necessary, and handles the full lifecycle of the data." This is the canonical patterns/broker-native-iceberg-catalog-registration pattern's Unity-Catalog-specific instance — Redpanda owns table creation, snapshot registration, schema evolution, and table deletion against Unity, not just against Glue/Polaris/BigLake.
-
Three-system labour division: Redpanda = real-time performance and reliability; Iceberg = open transactional table format optimised for analytics; Unity Catalog = governance + optimisation + federation + lifecycle. Verbatim: "Redpanda delivers real-time performance and reliability at scale. Iceberg provides an open, transactional table format optimized for analytics. Unity Catalog adds governance, optimization, federation, and lifecycle management across the entire system." The "dual citizenship" framing of the data verbatim: "the data becomes accessible to both worlds: The Apache Kafka® ecosystem continues to consume data in real time. The Iceberg ecosystem gains access to analytics-ready tables that can be queried by any Iceberg-compatible engine connected to Unity Catalog."
Systems named¶
- systems/redpanda — broker, Kafka-API-compatible.
- systems/redpanda-iceberg-topics — broker-native Iceberg table primitive; the "stream is the table" feature.
- systems/databricks — analytics / AI compute layer.
- systems/unity-catalog — Databricks' governed catalog; named in-post as the centralised governed location for the table view of streaming data.
- systems/apache-iceberg — open table format; Netflix origin disclosed in-post by Jason Reed.
- systems/apache-parquet — implicit columnar substrate.
Concepts extracted¶
- concepts/data-lakehouse — the post's organising architectural class; walks the historical convergence of lake + warehouse into the lakehouse.
- concepts/open-table-format — Iceberg as the standardised substrate that makes the convergence possible.
- concepts/iceberg-topic — the "stream is the table" primitive.
- concepts/iceberg-catalog-rest-sync — REST catalog protocol and its governance / interoperability properties.
- concepts/medallion-architecture — implicit downstream organisation pattern (post focuses on Bronze sink; doesn't enumerate Silver / Gold tiers explicitly).
Patterns extracted¶
- patterns/streaming-broker-as-lakehouse-bronze-sink — the "stream is the table" slogan is the architectural consequence of this pattern at its cleanest framing.
- patterns/broker-native-iceberg-catalog-registration — "Redpanda registers Iceberg tables, manages schema updates, deletes tables when necessary, and handles the full lifecycle" is the verbatim description of this pattern against Unity Catalog.
Operational numbers¶
None disclosed. The post is purely qualitative. No throughput, no latency, no customer counts, no commit-interval target, no snapshot-expiry cadence, no fleet sizes.
Historical arc¶
Pre-Hadoop data-warehouse era → Hadoop + HDFS data-lake era (schema-on-read, ELT, cheap object storage; governance-weak) → Iceberg era (Netflix-originated, later open-sourced; table semantics atop immutable object storage) → file-based-catalog era (early Iceberg deployments embedded catalog in object storage; metadata sprawl and governance gaps) → REST catalog era (standardised HTTP API for catalog governance + interoperability across engines) → broker-native integration (Iceberg Topics make the stream = the table with zero ETL) → joint governance (Unity Catalog federates, optimises, and governs).
The arc collapses two prior wiki-separate historical tracks:
- The warehouse → lake → lakehouse track canonicalised on concepts/data-lakehouse.
- The Iceberg file-catalog → REST-catalog track canonicalised on concepts/iceberg-catalog-rest-sync + concepts/iceberg-file-based-catalog.
This post's framing is the first on the wiki that brackets the two tracks together as a single arc ending in the stream-is-the-table move.
Caveats¶
- Tech-talk recap voice, not architectural retrospective. Pitched to the Redpanda blog audience; full session deferred to video ("The full talk is free to watch, but if you're more of a skimmer, this post covers the key moments").
- Marketing-link density. Heavy cross-linking to Redpanda product pages (Iceberg Topics use-case page, documentation index) and Databricks product pages (Unity Catalog, the Databricks home). "Iceberg Topics are now generally available in Redpanda Cloud across AWS, GCP, and Azure, as well as in Self-Managed and BYOC deployments" is product- promotion voice, not architecture disclosure.
- Zero production numbers — no throughput, no latency, no commit-interval distribution, no customer fleet sizes, no before/after quantitative wins on the migration from Airflow/Connect pipelines.
- Pedagogy-level historical arc. The Hadoop-era framing is textbook; the file-catalog-to-REST-catalog evolution is framed as a clean progression without walking through the 2021-2023 JDBC catalog / Hive Metastore / Nessie / intermediate shapes.
- Netflix-origin disclosure is at speaker attribution altitude only. Jason Reed's "having worked on the data team at Netflix" is the architectural-provenance claim; no Netflix engineering blog post is cited, no founding-team name is given, no internal pre-Iceberg system name is disclosed.
- REST catalog responsibilities list is generic. Permissions / concurrency / credential-delegation are named in the same bullets the 2025-04-07 GA post covered; this post re-frames them in governance-vendor voice but adds no mechanism depth.
- Unity Catalog responsibilities list is product-tour altitude. Fine-grained ACL / consistent security / metadata
- lineage / discovery are named without mechanism (how does lineage get captured across a streaming-produced table? not disclosed).
- "Open standards" framing. Iceberg REST catalog is standard; Unity Catalog's federation / optimisation / lifecycle-management functions are Databricks-proprietary extensions. Post doesn't draw the line.
- "Dead letter tables" + "Partitioning, performance, and optimizing queries" + "Handling dirty data" all mentioned as topics the video covers but not unpacked in the blog post. These link to canonical wiki pages (patterns/dead-letter-queue-for-invalid-records, concepts/iceberg-file-based-catalog partitioning, systems/redpanda-iceberg-topics workload management) but the blog itself elides them.
- Alex Gallego's "the goal is to remove the artificial line" sibling framing is absent — it's Matt Schumpert delivering the equivalent slogan on behalf of Redpanda here.
- No comparison to Databricks' own Delta Lake format as an alternative open table format. Delta Sharing / Delta Live Tables / Delta Lake not engaged despite Databricks being a speaker — post is pure Iceberg-route framing.
- No byline (Redpanda default attribution).
Cross-source continuity¶
- Companion to sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available|2025-04-07 Iceberg Topics GA — that post canonicalises the mechanism (9 named GA-grade properties, OIDC+TLS REST catalog sync, DLQ, snapshot expiry, transactional writes); this post re-frames the same primitive ~9 months later at a joint-Databricks narrative altitude with the "stream is the table" slogan, Netflix-origin-of- Iceberg disclosure, and Unity-Catalog-specific integration framing. Purely additive at the slogan / historical-arc level.
- Companion to sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda|2025-01-21 Medallion architecture post — that post walked the Bronze-sink mechanism with Iceberg-topic configuration; this post frames the same substrate as "streaming data is analytics-ready by default" at the joint-vendor altitude. The Medallion architecture is implicit here; Bronze is the only tier walked.
- Companion to sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platforms|2025-06-24 streaming-backbone essay — that post named Iceberg + Apache Polaris as the open-format escape from warehouse lock-in and Snowpipe Streaming as the proprietary-format alternative. This post frames the same wiring with Unity Catalog in the governance-hub role and omits the Polaris / Snowpipe comparison entirely.
- Companion to sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|2025-11-06 Redpanda 25.3 release preview — that post added Google BigLake as the fourth managed REST catalog Iceberg Topics integrates with (Unity / Polaris / Glue / BigLake). This post is older by ~1 month (~2026-01-06 vs ~2025-11-06 as published) but does not disclose the BigLake addition — the three-catalog matrix named here is Unity / Polaris / Glue. Post is either written-before-25.3-release or intentionally Databricks-focused and omits BigLake.
- Complements Databricks-side wiki coverage. No prior Databricks-authored source on the wiki has engaged Redpanda directly; this post is the first ingest in which a Databricks speaker (Jason Reed, formerly Netflix data team) appears as a cited architectural voice on the Iceberg origin and the Redpanda integration.
Source¶
- Original: https://www.redpanda.com/blog/real-time-lakehouse-databricks-iceberg
- Raw markdown:
raw/redpanda/2026-01-06-build-a-real-time-lakehouse-architecture-with-redpanda-and-d-d0f4cfd7.md
Related¶
- systems/redpanda-iceberg-topics — canonical Iceberg Topics system page; this post's "stream is the table" slogan is the cleanest articulation of the primitive.
- systems/unity-catalog — governance hub on the Databricks side of the integration.
- systems/databricks — analytics / AI compute layer.
- systems/apache-iceberg — open table format; Netflix origin disclosed here.
- concepts/iceberg-catalog-rest-sync — REST catalog protocol framed at governance-endpoint altitude.
- concepts/iceberg-topic — the "stream is the table" primitive.
- concepts/data-lakehouse — the post's organising architectural class.
- patterns/streaming-broker-as-lakehouse-bronze-sink — Bronze-sink architectural pattern this post's Redpanda+Unity framing instantiates.
- patterns/broker-native-iceberg-catalog-registration — Unity-Catalog-specific instance of the broker-owns-catalog pattern.
- companies/redpanda — company page.