Skip to content

Bridge Queries in Redpanda SQL

Summary

This post introduces bridge queries — a Redpanda SQL feature that lets a single SQL statement transparently read from both an Iceberg table (historical, cold data) and the underlying Redpanda topic (live, hot data) via one virtual table. The architectural insight is that bridge queries decouple the Iceberg flush interval from the freshness requirement: the topic covers the freshness gap at query time, so the flush cadence can be optimized purely for Parquet file layout (large, well-compressed files) rather than for data recency.

Key takeaways

  1. Bridge queries unify two storage tiers in one SQL statement. The engine routes historical rows from Iceberg and recent rows from the topic; no UNION, no dual-write complexity, no client awareness of tiers. The query author sees a single table.

  2. The flush interval is no longer driven by freshness. Previously, to keep analytics dashboards fresh to ≤1 minute, you had to flush from topic to Iceberg every minute — producing tiny Parquet files. With bridge queries, the topic itself covers the recency gap, so you can flush every 1–6 hours for optimal file sizes.

  3. Larger Parquet files are dramatically more efficient. Benefits cited: better column-encoding compression ratios, fewer S3 GET requests per query, less Iceberg catalog metadata overhead, more effective predicate pushdown (row group statistics work better on substantial row groups), and far less compaction work.

  4. The stitching is deduplication-free. Because the Redpanda broker itself performed the Iceberg conversion, it has authoritative knowledge of what data is in which tier. The bridge is a simple concatenation, not a dedup join — which keeps it high-performance.

  5. Configuration is two knobs. redpanda.iceberg.target.lag.ms (per-topic, controls how long before data must reach Iceberg) and datalake_translator_flush_bytes (cluster-wide, size threshold for flushing a Parquet file). Rule of thumb: throughput × lag ≈ target file size.

  6. Partition cardinality must stay low. The default partition spec is (hour(redpanda.timestamp)). Higher cardinality (minute-level, high-cardinality field) multiplies file count and negates the large-flush benefit. The post recommends day(redpanda.timestamp) for most workloads.

  7. Quantitative example. Going from 500 KB files every 30 seconds to 32 MB files a few times per hour yields visible improvements in query latency and S3 cost.

  8. Availability note. Bridge queries require Redpanda SQL (GA for BYOC AWS in 2026-05-27). Self-managed Enterprise availability targeted for late 2026 / early 2027.

Systems and concepts extracted

Systems

Concepts

Patterns

Operational numbers

Metric Value
Default lag target 1 min (redpanda.iceberg.target.lag.ms)
Recommended lag target for bridge queries 1 hour (3,600,000 ms)
Default translator flush size 32 MiB
Optional raised flush size 64 MiB
Small file example 500 KB every 30 s
Large file example 32 MB every few hours
Throughput rule of thumb throughput × lag ≈ target file size
Default partition spec (hour(redpanda.timestamp))
Self-managed availability Late 2026 / early 2027

Caveats

  • Bridge queries are available only on Redpanda SQL (BYOC AWS, GA since 2026-05-27). Self-managed support is months out.
  • The post does not disclose bridge-query latency overhead vs pure-Iceberg queries.
  • No discussion of consistency semantics (read-after-write guarantees) for the bridge stitching.
  • The "deduplication-free" claim depends on Redpanda being the sole writer to both tiers; if external writers push to the Iceberg table, guarantees may differ.

Source

Last updated · 559 distilled / 1,651 read