Skip to content

REDPANDA 2025-05-13

Read original ↗

Redpanda — Getting started with Iceberg Topics on Redpanda BYOC

Redpanda (2025-05-13) publishes a BYOC-customer setup walkthrough for Iceberg Topics, five weeks after the 25.1 GA disclosure (sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available). Narrow scope: Iceberg Topics extended to BYOC as a beta, plus a GCS+BigQuery worked example. Secondary disclosure tucked at the bottom: Redpanda BYOC doubles supported partition density per tier in 25.1.

The post is tutorial-heavy but canonicalises three new primitives the prior Iceberg-Topics ingests elided: the Iceberg topic-mode configuration surface (value_schema_id_prefix, value_schema_latest, key_value), the file-based catalog alternative to REST catalog sync, and the BYOC data-ownership corollary (customer-controlled bucket → direct query from customer-controlled engines like BigQuery without going through the REST catalog).

Summary

Redpanda extends Iceberg Topics from Dedicated to BYOC as a beta in 2025-05. For BYOC customers, the data-ownership property compounds: since the data plane already runs in the customer's cloud account and writes to customer-controlled object storage, BYOC Iceberg Topics land Parquet files and Iceberg metadata directly in the customer's own GCS / S3 / ADLS bucket — framed verbatim as "full control of your Iceberg data with zero compromises." The post walks a five-step GCP/BigQuery setup (create topic with iceberg_enabled: true + redpanda.iceberg.mode, register a Protobuf schema in the Schema Registry, produce via Redpanda Connect, configure Tiered Storage to a GCS bucket, create a BigQuery EXTERNAL TABLE pointing at the Iceberg metadata JSON, query). Appendix discloses a 2× partition-density increase per BYOC tier (Tier 1: 1,000 → 2,000, Tier 5: 22,800 → 45,600) thanks to improved per-partition memory efficiency in 25.1 — an adjacent substrate improvement worth noting against the wiki's existing Tier-7 (BYOC 1.75–2 GB/sec) datum.

Key takeaways

1. Iceberg Topics on BYOC is beta

Iceberg Topics GA in 25.1 targeted Redpanda Cloud Dedicated (Redpanda-operated data plane) across AWS, Azure, and GCP (sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available). This post extends the feature to BYOC where the data plane runs in the customer's VPC — tagged as beta, with three named beta- scoped capabilities verbatim:

"Self-service configuration of Iceberg settings at the cluster level via our rpk CLI or Cloud HTTP API. Direct integration with popular REST catalogs like Snowflake Open Catalog, or with Iceberg clients like Google BigQuery via a file-based catalog. Support for secure credential handling (e.g., Iceberg REST catalog secrets)."

The file-based catalog option is the novel disclosure — prior ingests (GA) canonicalised REST-catalog sync as the default integration surface and named the "built-in object-store catalog fallback" without detailing it. This post is where the file-based catalog becomes a first-class integration option for engines (like BigQuery) that read Iceberg directly from a metadata JSON pointer rather than through a catalog protocol. Canonicalised as concepts/iceberg-file-based-catalog.

2. BYOC data ownership compounds for Iceberg

Canonical BYOC property from systems/redpanda-byoc: the data plane runs inside the customer's own cloud account / VPC (Redpanda operates the control plane). When this topology is combined with Iceberg Topics, the Parquet files plus the Iceberg metadata files land in the customer's own bucket — not a Redpanda-managed shared bucket. Load-bearing framing verbatim:

"For BYOC customers who already control their own object storage buckets, this means full control of your Iceberg data with zero compromises."

Three practical consequences the post implies (partially explicit, partially inferable from the BYOC-canon on this wiki):

  • Direct query access. The customer's own analytics engine (BigQuery, Athena, Spark on EMR, Snowflake external-stage reads) can open the Iceberg metadata file directly from the customer's bucket with the customer's cloud IAM — no Redpanda-hosted catalog in the query path.
  • Data-residency compliance. For regulated workloads, the customer never has to trust Redpanda with storage-at-rest — the bucket is in the customer's project, under the customer's KMS keys, under the customer's auditing.
  • Bucket-lifecycle / tiering ownership. GCS Object Lifecycle rules, S3 Intelligent-Tiering, cross-region replication — all configured by the customer directly on their own bucket; the broker's Tiered Storage writes don't conflict with these since the bucket is customer-owned.

Canonicalised as concepts/byoc-data-ownership-for-iceberg (specialisation of the general Data Plane Atomicity BYOC property to the Iceberg data-output surface).

3. Iceberg topic-mode is the per-topic configuration dial

The redpanda.iceberg.mode topic-level configuration selects how the broker projects a Kafka record into an Iceberg row. Three modes named verbatim:

  • value_schema_id_prefix (used in this post's demo) — the record value is expected to start with a Schema Registry wire- format prefix (magic byte + 4-byte schema ID), and producers must use the Schema Registry wire format. The broker reads the schema from the registry, decodes the payload, and projects into a typed Iceberg table whose columns match the registered schema. Requires a registered Protobuf / Avro / JSON Schema.
  • value_schema_latest — broker uses the latest-version schema of a registered subject, without requiring the producer to prefix each record with a schema ID.
  • key_value — broker writes the raw key + value + Kafka metadata (offset, partition, timestamp) without attempting to decode the payload into typed columns. Schema-less ingestion mode.

The verbatim framing:

"Check that your Redpanda topic is configured with iceberg_enabled set to true and select the right redpanda.iceberg.mode (e.g., value_schema_id_prefix, value_schema_latest, or key_value). This configuration instructs Redpanda to write the topic data in the Iceberg format to the configured Tiered Storage location."

Canonicalised as concepts/iceberg-topic-mode — the missing configuration-surface primitive that the GA-ingest wiki pages for systems/redpanda-iceberg-topics and concepts/iceberg-topic named obliquely ("tables can be created with or without a schema") but did not enumerate by name.

4. BigQuery via file-based catalog: the external-table pattern

The worked example uses BigQuery as the query engine without going through a REST catalog. The integration mechanism is BigQuery's CREATE EXTERNAL TABLE ... OPTIONS(format = 'ICEBERG', metadata_file_paths = [...]) primitive — the customer points BigQuery at a specific Iceberg metadata JSON file in GCS, and BigQuery walks the manifest tree to find the data files. Verbatim DDL template:

CREATE EXTERNAL TABLE YOUR_PROJECT_ID.YOUR_BIGQUERY_DATASET.YOUR_TABLE_NAME
WITH CONNECTION 'YOUR_FULL_CONNECTION_ID'
OPTIONS (
  format = 'ICEBERG',
  metadata_file_paths = ['gs://your-bucket-name/path/to/your/iceberg/table/metadata/vX.metadata.json']
);

Caveat flagged verbatim: "update the external table definition in BigQuery if the location of the latest metadata file changes or you want to query a newer snapshot of the table data" — with a file- based catalog, the reader sees a snapshot pointer rather than a live-updated table; querying the latest data requires re-running the CREATE EXTERNAL TABLE or (per the linked GCP docs) updating the metadata path. This is the structural trade-off of file-based catalog vs REST catalog: the file-based path has no auto-discovery of new snapshots.

Canonicalised as patterns/external-table-over-iceberg-metadata-pointer — the BigQuery / Athena / Snowflake-external-stage shape where a query engine reads Iceberg via a metadata-JSON pointer rather than a catalog protocol.

5. Protobuf-first schema example

The demo's SensorData Protobuf schema nests a SensorMeasurements message inside SensorData with factory_id, machine_id, sensor_id, reading_timestamp, readings (nested), error_code, last_maintenance_timestamp — the demo payload pushes the Protobuf + Schema Registry + value_schema_id_prefix mode combination through Redpanda Connect's schema_registry_encode pipeline processor. The broker then encodes into Iceberg typed-column tables.

This composes with the 25.1 GA-ingest disclosure (sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available) that Protobuf schema normalization was added to the Schema Registry alongside Avro + JSON in 25.1 — making Protobuf-typed Iceberg tables a newly-unlocked combination. No Protobuf vs Avro vs JSON Schema trade-off discussion in the post.

6. Redpanda Connect as the producer

The demo uses a Redpanda Connect pipeline (generator → processor → Redpanda output) to produce synthetic sensor data at 300 records/sec (count: 300 period: 1s) with reject_errored as the output- failure-handling envelope. Post-Schema-Registry encoding, records land in the sensor_data topic which is configured as an Iceberg topic. No throughput / latency numbers beyond the synthetic generator cadence.

7. BYOC partition-density doubled in 25.1

Under "One more thing: Redpanda BYOC doubles partition density" the post discloses a second 25.1 improvement for BYOC customers: supported partition counts per cluster tier roughly doubled thanks to improved per-partition memory efficiency. Verbatim examples:

  • Tier 1: 1,000 → 2,000 partitions
  • Tier 5: 22,800 → 45,600 partitions

Plus the caveat "existing clusters may not yet support these partition counts if they haven't been upgraded to 25.1" — the change is 25.1-gated.

Three use-case framings (marketing-voice, load-bearing as capability statements):

  • "Scale further on smaller clusters, maximizing infrastructure efficiency and lowering cloud spend" — cost reduction via vertical density.
  • "Support more producers and consumers per topic, effortlessly" — a partition-count ceiling that used to force producer-side batching or consumer-group topology changes at density.
  • "Future-proof your architecture for rising data volumes" — capacity headroom for growth.

No disclosure of the underlying mechanism (whether it's per- partition kernel-thread overhead, memtable size reduction, offset-index compression, or something else). Canonicalised as concepts/broker-partition-density — the per-tier partition ceiling as an operator-facing substrate parameter.

Operational numbers

  • Demo producer rate: 300 records/sec via Redpanda Connect generator (count: 300, period: 1s).
  • Schema: Protobuf SensorData with 7 fields + nested SensorMeasurements (3 float fields).
  • Example partition density (25.1 BYOC):
  • Tier 1: 1,000 → 2,000 (2×)
  • Tier 5: 22,800 → 45,600 (2×)

No latency / throughput / cost / commit-cadence numbers disclosed. No customer case study.

Caveats

  • Tutorial altitude, not retrospective or benchmark. Vendor setup walkthrough with generic placeholders (YOUR_PROJECT_ID, YOUR_BIGQUERY_DATASET) and synthetic data. No production numbers, no customer name, no real workload volume.
  • File-based catalog mechanism underspecified. The post names the file-based catalog option but doesn't detail how BigQuery gets notified of new snapshots, what the snapshot-freshness trade-off is, or how to automate the CREATE EXTERNAL TABLE refresh. The linked BigQuery Iceberg external-tables doc carries the operational detail.
  • Partition-density mechanism unexplained. 2× improvement disclosed without the underlying per-partition memory / resource model. No before/after benchmark of any workload at the new density ceiling.
  • value_schema_id_prefix vs value_schema_latest vs key_value trade-offs elided. The post enumerates the three modes without walking when each is appropriate, failure modes (what if the Schema Registry is unreachable? what if a producer writes a record without a schema-ID prefix to an iceberg_enabled topic?), or performance characteristics.
  • DLQ and schema-evolution integration not walked in BYOC context. Four of the GA-disclosed Iceberg-Topics capabilities (DLQ, schema evolution, snapshot expiry, tunable workload management) aren't re-invoked in the BYOC setup walkthrough — the post defers to the GA post for those.
  • Protobuf-specific guidance thin. Avro and JSON Schema producer/consumer ergonomics not compared. Schema-evolution behaviour on Protobuf field additions / renames / deletes in an Iceberg-topic context not walked.
  • Object-store-catalog fallback framing muddled against file- based catalog. The GA post framed the broker's built-in object-store catalog as a fallback for when no REST catalog is available. This post frames file-based catalog as a primary integration option for engines (BigQuery) that read Iceberg directly. Whether these are the same mechanism, or distinct, is not clarified by the post — the wiki's concepts/iceberg-file-based-catalog page notes this ambiguity and treats them as aspects of the same broker-owned metadata-in-object-store shape.
  • Transactional-write and concurrent-writer semantics not revisited for BYOC. The GA-ingest caveats about isolation level and conflict-resolution policy remain; the BYOC post doesn't clarify them.
  • Tier definitions opaque. Tier 1 and Tier 5 are named with partition counts but without CPU, RAM, network, or storage dimensions — reading-the-wiki would require separate Redpanda tier documentation.

Source

Last updated · 470 distilled / 1,213 read