Skip to content

REDPANDA 2026-03-05

Read original ↗

Redpanda — Introducing Iceberg output for Redpanda Connect

Unsigned Redpanda launch post (~1,000 words, 2026-03-05) announcing the iceberg output connector for Redpanda Connect (shipped in Redpanda Connect v4.80.0, enterprise license). A declarative Apache Iceberg sink that writes streaming data directly to Iceberg tables from a YAML pipeline, using the Iceberg REST Catalog API. Positioned as the non-Kafka-source companion to the pre-existing broker-native Redpanda Iceberg Topics feature — different tool for different shapes.

Summary

The post walks a single motivating gap: Iceberg Topics gives zero-ETL from Kafka protocol → Iceberg table, but customers with non-Kafka sources (HTTP webhooks, Postgres CDC, GCP Pub/Sub) or who need in-stream transformations (PII stripping, flattening, type routing) needed an alternative. The iceberg output fills that gap, plugging into Redpanda Connect's "300+ inputs and processors" ecosystem. Three architectural properties are load-bearing:

  1. Registry-less, data-driven schema evolution — the connector senses new fields in raw JSON and auto-updates the Iceberg table metadata; no Schema Registry required; no manual DDL.
  2. Data-driven flushing (explicit inversion of timer-driven flushing) — flush only when data is present, avoiding the small-file problem on object storage and idle compute waste on quiet sources.
  3. Bloblang-interpolated multi-table routing from a single pipelinetable and namespace fields support Bloblang interpolation (e.g. 'events_${!this.event_type}'), so one pipeline definition routes messages to N tables based on message content, displacing "configuration hell" of per-table static mappings.

The connector speaks the Iceberg REST Catalog API and integrates with Apache Polaris™, AWS Glue, Databricks Unity Catalog, Snowflake Open Catalog, GCP BigLake, or any REST-speaking catalog.

Key takeaways

  1. Non-Kafka sources are the filled gap. Verbatim: "But maybe your data arrives from an HTTP webhook, a Postgres CDC stream, or a GCP Pub/Sub subscription. Maybe you need to normalize a payload, drop PII, or split a mixed event stream by type before anything hits the lakehouse. That's the gap this connector fills." The Iceberg output is explicitly positioned against Iceberg Topics' zero-ETL broker-to-table path, not as a replacement.

  2. Two-shape comparison table canonicalised verbatim — Iceberg Topics (in-broker, registry-driven, 1 topic → 1 table, zero extra components, Redpanda Cloud BYOC or Self-Managed EE) vs Iceberg output (stateless K8s sink, data-driven schema, multi-table routing, hundreds of non-Kafka sources, Redpanda Connect Enterprise tier). "Primary value: Zero-ETL convenience vs Integration flexibility."

  3. Registry-less schema evolution as a first-class feature. Verbatim: "The Iceberg output also uses schema evolution to sense new fields in an incoming JSON stream and automatically updates the Iceberg table metadata. No manual DDL, no registry required, and no ticket for the ops team every time an app update adds a column." Trade-off framing verbatim: "while other connectors can technically evolve a schema, doing so without a schema registry usually forces you into 'maintenance toil' (chaining brittle Kafka Connect SMTs) or leaves you with 'dirty data' (where all columns land as string data types). Redpanda Connect gives you the best of both worlds: the flexibility of raw JSON with the precision of a structured lakehouse." Canonicalised on the wiki as concepts/registry-less-schema-evolution, a fifth axis on concepts/schema-evolution.

  4. Data-driven flushing as the small-file-problem mitigation. Verbatim: "Unlike legacy connectors that heartbeat on a fixed timer regardless of activity, Redpanda Connect uses data-driven flushing. It only executes a flush operation when there is actual data to move, preventing the 'small file problem' on object storage and ensuring you aren't wasting compute cycles on empty operations." Canonicalised on the wiki as concepts/data-driven-flushing — the inversion of the timer-based flush common to Kafka Connect-era sinks. Related wiki substrate: concepts/small-file-problem-on-object-storage (new).

  5. Bloblang-interpolated multi-table routing from one pipeline. The table and namespace config fields are Bloblang-interpolated — a single pipeline routes messages across N tables based on message content. Worked example verbatim: table: 'events_${!this.event_type}'. Canonicalised on the wiki as patterns/bloblang-interpolated-multi-table-routing. Trade-off framed against "configuration hell" of traditional connectors that need rigid per-table mappings.

  6. Iceberg REST Catalog API is the integration surface. Lists Apache Polaris, AWS Glue, Databricks Unity Catalog, Snowflake Open Catalog, GCP BigLake as supported catalogs. Adds one catalog-specific worked example — Polaris with OAuth2 client-credentials — to the wiki's canonical Iceberg catalog REST sync substrate.

  7. OAuth2 token exchange + per-tenant REST catalog as the enterprise-isolation substrate. Verbatim: "Redpanda Connect fits into your existing OAuth2 token exchange and per-tenant REST catalog (like Polaris) workflows out of the box. And because Redpanda Connect is so lightweight (runs as low as 0.1 vCPU), you can deploy isolated, high-density pipelines for every tenant or department without blowing your cloud budget." 0.1 vCPU per-pipeline density is the operational-shape claim; no fleet numbers.

  8. Append-only only at launch; upserts on the roadmap. Verbatim: "This initial release focuses on high-speed append-only ingestion (with upserts on the roadmap)." A material scope limit for CDC workloads — Postgres CDC feeding UPDATE/DELETE operations cannot cleanly land through this connector in v4.80.0.

Systems extracted

Concepts extracted

  • concepts/iceberg-catalog-rest-sync — extended with the sink-connector-altitude instance (prior instances were broker-native).
  • concepts/schema-evolution — extended with the registry-less / data-driven axis.
  • concepts/registry-less-schema-evolution (new) — the property of evolving an Iceberg table's schema from raw JSON without a Schema Registry, framed as the "best of both worlds" between brittle SMT chains and dirty all-string tables.
  • concepts/data-driven-flushing (new) — flush-on-data-present rather than heartbeat-on-timer. Mitigates the small-file problem on object storage.
  • concepts/small-file-problem-on-object-storage (new) — the pathology that small, frequently-flushed files on object storage create: metadata bloat, read-amp during scan, per-file listing cost.
  • concepts/bloblang (new) — Redpanda Connect's declarative mapping language, previously referenced implicitly; canonicalised here as the mechanism behind multi-table routing and in-stream reshaping.

Patterns extracted

Operational numbers

  • Redpanda Connect v4.80.0 — the release shipping the Iceberg output.
  • 0.1 vCPU per-pipeline lower bound cited for high-density per-tenant deployment.
  • No throughput numbers, no latency numbers, no fleet numbers, no case studies.

Caveats

  • Launch-post voice"Today we're announcing…" opener, "Suffer no more with Redpanda Connect!" marketing register. Zero production incidents, zero customer case studies, zero quantitative disclosures beyond the 0.1 vCPU density datum.
  • Append-only at launch — upserts on roadmap. Material scope limit for CDC UPDATE/DELETE workloads in v4.80.0.
  • Schema-evolution mechanism depth not disclosed — the post says the connector "senses new fields in an incoming JSON stream and automatically updates the Iceberg table metadata", but doesn't disclose: how type inference is done from raw JSON (string vs number vs nested-object); what happens on type conflicts across records (coerce? quarantine? error?); whether column renames or type-widening are supported; whether deleted fields leave tombstone columns. The "best of both worlds" claim is asserted, not mechanism-shown.
  • Data-driven flushing mechanism depth not disclosed — the trigger shape (per-record? per-batch? watermark-based?), flush interval bounds, and interaction with Iceberg snapshot cadence are all elided.
  • No benchmark against the "legacy connectors" it foils (Kafka Connect Iceberg Sink, Tabular / Databricks Iceberg sinks).
  • No discussion of commit-tuning trade-offs — Iceberg snapshot commits on object storage have a per-commit overhead; commit frequency vs small-file tradeoff is a real operational axis and the post name-checks "commit tuning" only as a docs-reference in passing.
  • Enterprise-gated — requires Redpanda Connect Enterprise tier license; Apache 2.0 Redpanda Connect core users cannot access this connector. Contrast with the 2025-06-17 dynamic-plugins launch which was Apache 2.0.
  • Unsigned (Redpanda default attribution).
  • Partition spec expressions named as a configurable feature but not walked in the post — only the flat events_${!this.event_type} table-routing example is shown.

Cross-source continuity

No existing-claim contradictions — the post is strictly additive.

Example pipeline (verbatim from post)

input:
  redpanda:
    seed_brokers: ["${REDPANDA_BROKERS}"]
    topics: ["events"]
    consumer_group: "iceberg-sink"

pipeline:
  processors:
    - mapping: |
        root = this
        root.ingested_at = now()

output:
  iceberg:
    catalog:
      url: https://polaris.example.com/api/catalog
      warehouse: analytics
      auth:
        oauth2:
          client_id: "${CATALOG_CLIENT_ID}"
          client_secret: "${CATALOG_CLIENT_SECRET}"
    namespace: raw.events
    table: 'events_${!this.event_type}'
    storage:
      aws_s3:
        bucket: my-iceberg-data
        region: us-west-2

Note the namespace + table Bloblang interpolation (literal string raw.events for namespace, templated 'events_${!this.event_type}' for table). The in-pipeline mapping processor adds an ingested_at timestamp — exemplifying the in-stream-transformation value proposition before records land in Iceberg.

Source

Last updated · 470 distilled / 1,213 read