Skip to content

CONCEPT Cited by 4 sources

Iceberg topic

An Iceberg topic is a streaming-broker construct — introduced by Redpanda in 2024 — in which a single logical entity is both a Kafka-protocol topic and an Apache Iceberg table backed by the same data. Producers write records to the topic via the normal Kafka producer API; the broker transparently:

  1. Appends the records to the broker's local log segments (normal topic durability).
  2. Projects the row-oriented records into a columnar file format (Parquet).
  3. Writes the Parquet files to external object storage (S3 / GCS / ADLS).
  4. Updates the external Iceberg catalog (Databricks Unity, Snowflake Polaris, AWS Glue, etc.) with the new snapshot metadata.

Downstream open-table-format-aware engines — ClickHouse, Snowflake, Databricks, Dremio, Trino, Spark, Flink — can then query the topic data as an Iceberg table without any ETL job in between.

"Redpanda's newly launched feature, topic-level integration with Iceberg, enables creating Iceberg tables from your topics — storing row-oriented data in a Redpanda topic inside object storage in a columnar format, supporting the Iceberg table format." (Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Properties

  • Zero-ETL integration. The integration is owned by the broker; customers don't configure Kafka Connect / Redpanda Connect / Debezium; they don't write Python jobs on Airflow. Topic→Iceberg is a configuration flip.
  • Metadata preservation. The Iceberg tables carry the Kafka-layer technical metadata (offset, partition, timestamp) as columns, preserving the audit trail the streaming layer already had.
  • Schema optional. Tables can be created with or without a schema. Without one, Redpanda uses a default (record key + value + timestamp + offset + partition). With one, the Iceberg table is fully typed.
  • External catalog registration. Tables get registered with a chosen Iceberg REST catalog (Databricks Unity, Snowflake Polaris); catalog metadata is the integration surface downstream engines use for discovery.
  • Row-to-columnar projection at write time. The broker is responsible for the Parquet conversion; clients continue to see row-oriented Kafka semantics.

Architectural role: streaming broker as lakehouse Bronze layer

Iceberg topics realise the streaming- broker-as-lakehouse-Bronze-sink pattern: the Bronze tier of a Medallion Architecture becomes a derived view of topics the business is already producing to. No separate Bronze-tier ingestion pipeline exists.

Configuration surface: the three modes

The producer-side schema-projection strategy is a per-topic configuration dial canonicalised as concepts/iceberg-topic-mode: value_schema_id_prefix (typed Iceberg table via Schema Registry wire-format producers), value_schema_latest (latest-schema projection), key_value (schema-less BYTES columns + Kafka metadata). Mode selection is orthogonal to the catalog-integration shape (Source: sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc).

GA feature surface (Redpanda 25.1, 2025-04-07)

The 2025-04-07 Redpanda 25.1 release promoted Iceberg topics to GA across AWS, Azure, and GCP, disclosing nine named properties beyond the 2025-01 pedagogy framing. Four address table management:

  • Custom hierarchical bucketed partitioning — operator- controllable Iceberg partition transforms for query-side pruning.
  • Built-in dead-letter queues for invalid records (patterns/dead-letter-queue-for-invalid-records) — schema- invalid records route to a DLQ topic rather than dropping the batch.
  • Seamless Iceberg-spec-compliant schema evolution — add, rename, delete column operations match the Iceberg specification's evolution surface. Retires the pre-GA "schema-evolution path" caveat below for the Iceberg side of the loop (Kafka-serializer interaction is still operator domain).
  • Automatic snapshot expiry — the broker owns the metadata GC loop, bounding catalog metadata growth. Retires the pre-GA "compaction + GC ownership unclear" caveat for the snapshot-expiry half of the loop; small-file compaction ownership remains open.

Five address catalog integration:

  • Secure REST catalog sync via OIDC+TLS sync to Iceberg REST catalogs (Snowflake Open Catalog / Apache Polaris, Databricks Unity, AWS Glue).
  • Transactional writes via Iceberg's commit-protocol serialisation for safe concurrent multi-writer access.
  • Automatic table discovery — Iceberg-configured topics auto-register as catalog tables; downstream engines see new tables without manual CREATE TABLE (patterns/broker-native-iceberg-catalog-registration).
  • Built-in object-store catalog fallback when no REST catalog is available.
  • Tunable workload management — operator knob for the snapshot-vs-live-topic lag ceiling, making the commit-cadence lag floor an explicit operational parameter.

Alternatives displaced

Pre-Iceberg-topics, moving streaming data into a lakehouse had two canonical shapes, both of which the Iceberg-topic primitive subsumes:

  1. Custom Airflow + Python jobs reading from Kafka, transforming, writing Parquet to S3, updating Iceberg catalog. Redpanda's explicit framing: "required specialized talent to write, test, and maintain them, which is error-prone and time-consuming."
  2. Managed connectors like Redpanda Connect or Kafka Connect with an Iceberg sink. Redpanda's framing: "these systems introduce a middleman to the architecture, requiring you to configure and maintain a separate set of clusters for data integration — extraction, transformation, and routing. Additionally, there's no option for a configuration-based approach to make existing topics available in Iceberg without deploying new code."

Costs / caveats

  • Duplicate storage during retention window. Data lives in both the broker's log segments and in object-storage Parquet files simultaneously until the broker's retention policy expires. Bronze durability on object storage is independent of topic retention, but the dual-write overlap is real.
  • Compaction ownership still unclear post-GA. Redpanda 25.1 internalises snapshot expiry as a broker-owned loop but does not explicitly name small-file compaction (Iceberg's other GC-adjacent loop) as broker-owned. Customers writing high-throughput Iceberg topics may still need an external Spark / Flink compaction job to merge small Parquet files for scan performance. See systems/apache-iceberg for the generic externalisation-cost framing.
  • Latency floor is now an explicit knob, not an open question. The 25.1 GA release discloses "tunable workload management" as the operator dial for the snapshot-vs-live-topic lag ceiling — trading end-to-end freshness for broker-CPU budget on Parquet projection + catalog commits. But the chosen value still bounds downstream Iceberg-reader latency below.
  • Schema evolution path (Iceberg-spec side resolved; Kafka- serializer side still open). 25.1 supports full Iceberg-spec evolution (adds / renames / deletes), so the Iceberg side of the loop is clean. How Iceberg-topic schema changes interact with Kafka-client serializers (Avro / JSON Schema / Protobuf via a schema registry) is a source of operational complexity still not walked in the pedagogy or GA launch posts.
  • Vendor-specific primitive. Iceberg topics are a Redpanda feature; Apache Kafka does not have an equivalent native primitive in the Kafka protocol as of 2025-04. Downstream portability via the Kafka wire protocol is preserved, but the integration shape is Redpanda-specific.
  • DLQ operational surface under-specified. Built-in DLQ is named at GA but envelope-schema shape, retention defaults, replay tooling, and monitoring recommendations are deferred to product documentation — open operational surface for customers.

Seen in

  • sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricksCanonical "the stream is the table" slogan at joint- Databricks altitude. The cleanest articulation of the Iceberg- topic primitive verbatim: "Redpanda's Iceberg Topics allow you to store topic data in the cloud in the Iceberg open table format, so you can query real-time data while it's still streaming. This grants you instant analytics on the freshest data without the complexities of traditional ETL processes." Matt Schumpert (Redpanda) frames the partnership goal verbatim: "The goal of this partnership is to remove the artificial line between real-time data and analytical data." Four verbatim operational wins: "Lower infrastructure costs. Faster time-to- insight. Fewer human hours spent on pipeline maintenance. More free time to build valuable data products and AI applications." No new mechanism; the primitive is framed at its cleanest slogan altitude.
  • sources/2025-05-13-redpanda-getting-started-with-iceberg-topics-on-redpanda-byoc — BYOC-beta extension of Iceberg Topics. Canonicalises the per-topic mode configuration surface (value_schema_id_prefix / value_schema_latest / key_value) and the file-based catalog option as a sibling to REST catalog sync. Composes with BYOC data ownership when the broker writes to a customer-owned bucket.
  • sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-availableGA release disclosure (Redpanda 25.1, multi-cloud availability). Canonicalises the nine-property feature surface (four table-management, five catalog-integration) that distinguishes GA Iceberg topics from the pre-GA framing. Retires two caveats from the 2025-01-21 pedagogy ingest for the snapshot-expiry and Iceberg-spec-schema-evolution halves of the operational loop.
  • sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpandapre-GA pedagogy launch for the Iceberg-topic primitive. Walks the architectural role (Bronze-tier sink), names the displaced alternatives (custom ETL jobs / Connect clusters), and frames the integration as a configuration flip rather than a code change. Pedagogy altitude; no latency / throughput / cost numbers.
Last updated · 470 distilled / 1,213 read