CONCEPT Cited by 6 sources

Iceberg topic¶

An Iceberg topic is a streaming-broker construct — introduced by Redpanda in 2024 — in which a single logical entity is both a Kafka-protocol topic and an Apache Iceberg table backed by the same data. Producers write records to the topic via the normal Kafka producer API; the broker transparently:

Appends the records to the broker's local log segments (normal topic durability).
Projects the row-oriented records into a columnar file format (Parquet).
Writes the Parquet files to external object storage (S3 / GCS / ADLS).
Updates the external Iceberg catalog (Databricks Unity, Snowflake Polaris, AWS Glue, etc.) with the new snapshot metadata.

Downstream open-table-format-aware engines — ClickHouse, Snowflake, Databricks, Dremio, Trino, Spark, Flink — can then query the topic data as an Iceberg table without any ETL job in between.

"Redpanda's newly launched feature, topic-level integration with Iceberg, enables creating Iceberg tables from your topics — storing row-oriented data in a Redpanda topic inside object storage in a columnar format, supporting the Iceberg table format." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Properties¶

Zero-ETL integration. The integration is owned by the broker; customers don't configure Kafka Connect / Redpanda Connect / Debezium; they don't write Python jobs on Airflow. Topic→Iceberg is a configuration flip.
Metadata preservation. The Iceberg tables carry the Kafka-layer technical metadata (offset, partition, timestamp) as columns, preserving the audit trail the streaming layer already had.
Schema optional. Tables can be created with or without a schema. Without one, Redpanda uses a default (record key + value + timestamp + offset + partition). With one, the Iceberg table is fully typed.
External catalog registration. Tables get registered with a chosen Iceberg REST catalog (Databricks Unity, Snowflake Polaris); catalog metadata is the integration surface downstream engines use for discovery.
Row-to-columnar projection at write time. The broker is responsible for the Parquet conversion; clients continue to see row-oriented Kafka semantics.

Architectural role: streaming broker as lakehouse Bronze layer¶

Iceberg topics realise the streaming- broker-as-lakehouse-Bronze-sink pattern: the Bronze tier of a Medallion Architecture becomes a derived view of topics the business is already producing to. No separate Bronze-tier ingestion pipeline exists.

Configuration surface: the three modes¶

The producer-side schema-projection strategy is a per-topic configuration dial canonicalised as concepts/iceberg-topic-mode: value_schema_id_prefix (typed Iceberg table via Schema Registry wire-format producers), value_schema_latest (latest-schema projection), key_value (schema-less BYTES columns + Kafka metadata). Mode selection is orthogonal to the catalog-integration shape (Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc).

GA feature surface (Redpanda 25.1, 2025-04-07)¶

The 2025-04-07 Redpanda 25.1 release promoted Iceberg topics to GA across AWS, Azure, and GCP, disclosing nine named properties beyond the 2025-01 pedagogy framing. Four address table management:

Custom hierarchical bucketed partitioning — operator- controllable Iceberg partition transforms for query-side pruning.
Built-in dead-letter queues for invalid records (patterns/dead-letter-queue-for-invalid-records) — schema- invalid records route to a DLQ topic rather than dropping the batch.
Seamless Iceberg-spec-compliant schema evolution — add, rename, delete column operations match the Iceberg specification's evolution surface. Retires the pre-GA "schema-evolution path" caveat below for the Iceberg side of the loop (Kafka-serializer interaction is still operator domain).
Automatic snapshot expiry — the broker owns the metadata GC loop, bounding catalog metadata growth. Retires the pre-GA "compaction + GC ownership unclear" caveat for the snapshot-expiry half of the loop; small-file compaction ownership remains open.

Five address catalog integration:

Secure REST catalog sync via OIDC+TLS sync to Iceberg REST catalogs (Snowflake Open Catalog / Apache Polaris, Databricks Unity, AWS Glue).
Transactional writes via Iceberg's commit-protocol serialisation for safe concurrent multi-writer access.
Automatic table discovery — Iceberg-configured topics auto-register as catalog tables; downstream engines see new tables without manual CREATE TABLE (patterns/broker-native-iceberg-catalog-registration).
Built-in object-store catalog fallback when no REST catalog is available.
Tunable workload management — operator knob for the snapshot-vs-live-topic lag ceiling, making the commit-cadence lag floor an explicit operational parameter.

Alternatives displaced¶

Pre-Iceberg-topics, moving streaming data into a lakehouse had two canonical shapes, both of which the Iceberg-topic primitive subsumes:

Custom Airflow + Python jobs reading from Kafka, transforming, writing Parquet to S3, updating Iceberg catalog. Redpanda's explicit framing: "required specialized talent to write, test, and maintain them, which is error-prone and time-consuming."
Managed connectors like Redpanda Connect or Kafka Connect with an Iceberg sink. Redpanda's framing: "these systems introduce a middleman to the architecture, requiring you to configure and maintain a separate set of clusters for data integration — extraction, transformation, and routing. Additionally, there's no option for a configuration-based approach to make existing topics available in Iceberg without deploying new code."

Costs / caveats¶

Duplicate storage during retention window. Data lives in both the broker's log segments and in object-storage Parquet files simultaneously until the broker's retention policy expires. Bronze durability on object storage is independent of topic retention, but the dual-write overlap is real.
Compaction ownership still unclear post-GA. Redpanda 25.1 internalises snapshot expiry as a broker-owned loop but does not explicitly name small-file compaction (Iceberg's other GC-adjacent loop) as broker-owned. Customers writing high-throughput Iceberg topics may still need an external Spark / Flink compaction job to merge small Parquet files for scan performance. See systems/apache-iceberg for the generic externalisation-cost framing.
Latency floor is now an explicit knob, not an open question. The 25.1 GA release discloses "tunable workload management" as the operator dial for the snapshot-vs-live-topic lag ceiling — trading end-to-end freshness for broker-CPU budget on Parquet projection + catalog commits. But the chosen value still bounds downstream Iceberg-reader latency below.
Schema evolution path (Iceberg-spec side resolved; Kafka- serializer side still open). 25.1 supports full Iceberg-spec evolution (adds / renames / deletes), so the Iceberg side of the loop is clean. How Iceberg-topic schema changes interact with Kafka-client serializers (Avro / JSON Schema / Protobuf via a schema registry) is a source of operational complexity still not walked in the pedagogy or GA launch posts.
Vendor-specific primitive. Iceberg topics are a Redpanda feature; Apache Kafka does not have an equivalent native primitive in the Kafka protocol as of 2025-04. Downstream portability via the Kafka wire protocol is preserved, but the integration shape is Redpanda-specific.
DLQ operational surface under-specified. Built-in DLQ is named at GA but envelope-schema shape, retention defaults, replay tooling, and monitoring recommendations are deferred to product documentation — open operational surface for customers.

Seen in¶

sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline — Iceberg Topics canonicalised as the substrate for in-cluster ad-hoc SQL via the 2026-05-27 GA of Redpanda SQL (built on Oxla). The Iceberg Topic's load-bearing dual-tier simultaneous-write property (live broker log + cold Parquet/Iceberg files in object storage) is what makes the transparent two-tier query bridge feasible. Verbatim: "If you're using Redpanda Iceberg Topics, which store your streaming data in both a live tier and a Parquet/Iceberg cold tier in S3 or GCS simultaneously, Redpanda SQL bridges the two tiers transparently. The engine figures out an optimized read path across both." Adds a fourth user-facing access path to the substrate (in-cluster SQL) alongside the existing three (Kafka client, external Iceberg-aware engine, file-based-catalog reader).
sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricks — Canonical "the stream is the table" slogan at joint- Databricks altitude. The cleanest articulation of the Iceberg- topic primitive verbatim: "Redpanda's Iceberg Topics allow you to store topic data in the cloud in the Iceberg open table format, so you can query real-time data while it's still streaming. This grants you instant analytics on the freshest data without the complexities of traditional ETL processes." Matt Schumpert (Redpanda) frames the partnership goal verbatim: "The goal of this partnership is to remove the artificial line between real-time data and analytical data." Four verbatim operational wins: "Lower infrastructure costs. Faster time-to- insight. Fewer human hours spent on pipeline maintenance. More free time to build valuable data products and AI applications." No new mechanism; the primitive is framed at its cleanest slogan altitude.
sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc — BYOC-beta extension of Iceberg Topics. Canonicalises the per-topic mode configuration surface (value_schema_id_prefix / value_schema_latest / key_value) and the file-based catalog option as a sibling to REST catalog sync. Composes with BYOC data ownership when the broker writes to a customer-owned bucket.
sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — GA release disclosure (Redpanda 25.1, multi-cloud availability). Canonicalises the nine-property feature surface (four table-management, five catalog-integration) that distinguishes GA Iceberg topics from the pre-GA framing. Retires two caveats from the 2025-01-21 pedagogy ingest for the snapshot-expiry and Iceberg-spec-schema-evolution halves of the operational loop.
sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — pre-GA pedagogy launch for the Iceberg-topic primitive. Walks the architectural role (Bronze-tier sink), names the displaced alternatives (custom ETL jobs / Connect clusters), and frames the integration as a configuration flip rather than a code change. Pedagogy altitude; no latency / throughput / cost numbers.
sources/2024-12-03-redpanda-redpanda-243-extends-lakehouses-with-streaming-data-cdc — origin-point beta announcement (Redpanda 24.3 release). The Iceberg Topics beta ships for **self-managed Enterprise
Redpanda Cloud BYOC — the BYOC-first framing that later sources (sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc) canonicalise is present from launch. Verbatim: "The integration works on a per-topic basis, allowing you to mix and match Iceberg Topics alongside other regular topics in the same cluster" — establishes the per-topic opt-in model later formalised as iceberg_enabled: true + concepts/iceberg-topic-mode. Non-production gating at beta; GA follow-up in 25.1 (2025-04-07).

systems/redpanda-iceberg-topics — the system-level wiki entry for this feature.
systems/redpanda · systems/apache-iceberg — the two systems the primitive bridges.
concepts/iceberg-catalog-rest-sync · concepts/iceberg-snapshot-expiry — the GA-canonicalised concepts this concept composes with.
concepts/medallion-architecture · concepts/data-lakehouse · concepts/open-table-format — the architectural context.
patterns/streaming-broker-as-lakehouse-bronze-sink — the pattern Iceberg topics canonically instantiate.
patterns/broker-native-iceberg-catalog-registration — the catalog-registration pattern that makes Iceberg topics zero-ETL from the operator's perspective.
patterns/dead-letter-queue-for-invalid-records — the data-quality pattern the built-in DLQ feature instantiates.
patterns/telemetry-to-lakehouse — sibling pattern (telemetry rather than transactional data) where the same broker-native Iceberg-sink move applies.