Skip to content

CONCEPT Cited by 1 source

Two-tier stream + Iceberg query bridge

A two-tier stream + Iceberg query bridge is the architectural property where a single SQL statement reads transparently across both the live streaming tier (broker log segments) and the cold historical tier (Parquet files in object storage registered in an Iceberg catalog) of a streaming substrate, with the engine planning a unified read path across the two tiers. The consumer writes one query against one table; the engine decides which records to fetch from which tier. First wiki canonicalisation 2026-05-27.

Definition

A query engine satisfies the two-tier bridge property when:

  1. One logical table spans two physical tiers. The same table surface (in the SQL namespace) covers records currently in the live broker tier and records that have been projected to the cold Iceberg tier.
  2. One SQL statement is enough. The consumer doesn't write a UNION ALL between "SELECT … FROM live_topic" and "SELECT … FROM iceberg_table"; one query against the unified table works.
  3. The engine plans the read. The optimisation decision — which tier holds the records that satisfy the predicate, and how to fetch them — is engine-side, not consumer-side.
  4. Records arrive in the result set indistinguishably. A record that arrived three milliseconds ago and a record that arrived three years ago appear in the same result set without the consumer being aware of which tier supplied which row.

Verbatim disclosure (Redpanda 2026-05-27)

"The data you're querying might have arrived three years ago or three milliseconds ago. Either way: same table, same query, same endpoint, same result. If you're using Redpanda Iceberg Topics, which store your streaming data in both a live tier and a Parquet/Iceberg cold tier in S3 or GCS simultaneously, Redpanda SQL bridges the two tiers transparently. The engine figures out an optimized read path across both. (And you don't have to care.)"

(Source: sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline)

Pre-condition: dual-tier write substrate

The two-tier query bridge depends on a substrate that simultaneously writes records to both tiers — the broker is the producer for both the live log and the Parquet/Iceberg snapshot sequence. The canonical wiki instance is the Iceberg topic (Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available), where the broker:

  1. Appends records to local log segments (live tier).
  2. Projects rows to columnar Parquet files (cold tier).
  3. Writes Parquet to S3 / GCS.
  4. Registers Iceberg snapshots with the catalog.

Without this substrate property, a query bridge across two tiers requires consumer-side reconciliation (the Lambda architecture shape).

Comparison with the Lambda-architecture shape

Property Two-tier query bridge (this concept) Lambda architecture (consumer-side merge)
Read-side query Single SQL statement Two queries, consumer merges (or stream + batch jobs)
Engine plans the cross-tier read Yes No — consumer / orchestrator does
Consistency between tiers Engine-controlled (single planner) Consumer-controlled
Schema co-evolution One schema (substrate-managed) Two schemas (must be kept aligned)
Operator footprint One engine, one substrate Two engines (e.g., Spark for batch, Flink for stream)

The Lambda architecture is the historically-canonical answer to "query live + historical data" at scale. The two-tier query bridge is structurally simpler: the substrate carries the dual-tier shape on the write side, and the engine carries the dual-tier shape on the read side.

Mechanism (not detailed in 2026-05-27 source)

The launch post repeats "the engine figures out an optimized read path across both" without disclosing the routing primitive. Open questions:

  • Per-partition routing? The engine could decide per partition whether the live tier or the cold tier is authoritative based on snapshot freshness.
  • Per-offset-range routing? Records up to the last committed Iceberg snapshot read from cold; subsequent records read from live.
  • Per-timestamp routing? Time-range predicates could pin records before the snapshot horizon to cold and records after to live.
  • Predicate pushdown. Predicates that fully match Iceberg partition / column-statistic skipping might be pushed entirely to the cold tier.
  • Cross-tier transactionality. Isolation level for queries that span both tiers (read-committed across the snapshot horizon? Snapshot-isolation against an Iceberg snapshot ID? Eventually-consistent?) is not specified.

These mechanism details are absent from the 2026-05-27 GA launch post and may be disclosed in subsequent technical blogs or documentation.

Canonical wiki instance

  • systems/redpanda-sql (2026-05-27 GA) — the canonical wiki instance: queries the Iceberg Topics substrate (live broker log + cold Parquet/Iceberg) through one SQL statement; engine plans the unified read path.

Seen in

Caveats

  • Substrate-dependent. The bridge property requires a dual-tier-write substrate (Iceberg Topics, or any equivalent that simultaneously writes records to a broker tier and a cold catalog-registered tier). Without that substrate, the bridge reduces to consumer-side reconciliation.
  • Mechanism not disclosed. 2026-05-27 source describes the user-facing property (single statement, transparent routing) but not the engine-side routing mechanism, predicate pushdown rules, or cross-tier consistency guarantees.
  • Read-only framing. The bridge is described as a query property; INSERT / UPDATE / DELETE semantics across the two tiers are not addressed.
  • Single-substrate scope. The bridge as disclosed is Redpanda-Iceberg-Topics specific; whether the engine bridges arbitrary cold Iceberg tables (e.g. tables not produced by Iceberg Topics) is not clarified.
Last updated · 542 distilled / 1,571 read