PATTERN Cited by 1 source
Transparent hot+cold tier query¶
Read live streaming records (hot tier) and historical Parquet/ Iceberg records (cold tier) in a single SQL statement against a single logical table, with the engine planning a unified read path across both tiers and the consumer unaware of which tier supplies which row. First wiki canonicalisation 2026-05-27 via Redpanda SQL over the Iceberg Topics substrate.
Pattern shape¶
SELECT * FROM orders
WHERE timestamp > NOW() - INTERVAL '7 days'
│
▼
┌──────────────────────────────┐
│ Query engine (Oxla MPP) │
│ plans unified read path │
└──────────────────────────────┘
│ │
hot-tier │ │ cold-tier
records │ │ records
▼ ▼
┌───────────────┐ ┌───────────────────┐
│ Live broker │ │ Iceberg snapshot │
│ log segments │ │ Parquet files in │
│ (last seconds │ │ S3 / GCS │
│ of records) │ │ (older records) │
└───────────────┘ └───────────────────┘
Result set: indistinguishable rows
A single result set; rows from both tiers; consumer can't tell (and doesn't need to) which tier any given row came from.
Pre-condition¶
The pattern requires a dual-tier-write substrate — records are simultaneously written to both the live tier and the cold tier as they arrive, by the same write path. The canonical wiki instance of this is the Iceberg topic (Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available), where the broker writes records to both:
- Local log segments (hot tier).
- Parquet files in object storage, registered with an Iceberg catalog (cold tier).
Both writes happen in the broker's record-handling path; both tiers agree on the same record sequence. The cold tier lags the hot tier by the snapshot-commit cadence (operator-tunable).
Without the dual-tier-write substrate property, the pattern reduces to the Lambda architecture (consumer-side merge of independently- written streaming + batch tiers).
Verbatim disclosure (Redpanda 2026-05-27)¶
"The data you're querying might have arrived three years ago or three milliseconds ago. Either way: same table, same query, same endpoint, same result. If you're using Redpanda Iceberg Topics, which store your streaming data in both a live tier and a Parquet/Iceberg cold tier in S3 or GCS simultaneously, Redpanda SQL bridges the two tiers transparently. The engine figures out an optimized read path across both. (And you don't have to care.)"
(Source: sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline)
Why this pattern matters¶
Three structural payoffs over the alternatives:
| Alternative | Cost |
|---|---|
| Lambda architecture (consumer-side merge) | Two queries, two engines, schema co-evolution tax, custom merge logic |
| Warehouse-only (cold-tier-only query) | Loses freshness — last few seconds/minutes of data invisible |
| Streaming-only (hot-tier-only query) | Loses history — can't query records past the broker retention horizon |
| Two separate SQL surfaces (one per tier) | Consumer chooses + composes; UNION ALL pollutes business-logic queries |
| Transparent hot+cold tier query (this pattern) | One query, one schema, one consumer mental model |
The fourth row's "two surfaces, consumer composes" is what most predefined-streaming-SQL stacks effectively do: the user has a streaming-SQL surface (ksqlDB, Flink) and a warehouse-SQL surface (Snowflake, BigQuery), and writes a UNION ALL when they need both. The pattern collapses that to one surface.
Mechanism (open question per 2026-05-27 disclosure)¶
The launch post repeats "the engine figures out an optimized read path across both" without disclosing the mechanism. Plausible implementations:
- Per-partition snapshot horizon. Engine maintains, per partition, the last offset committed to Iceberg. Records beyond that offset come from the live tier; records up to that offset come from the cold tier.
- Per-timestamp bisection. Time-range predicates bisect at the last snapshot's commit timestamp; older predicates push to cold, newer push to live.
- Predicate pushdown. Predicates fully matching Iceberg partition / column-statistic skipping push entirely to cold; predicates that don't may scan both tiers and dedupe.
- Hybrid plan. The optimiser decides per query whether the cold-tier scan is profitable (large analytic over months of history) or the live-tier scan is cheaper (point lookup of the last few seconds).
The 2026-05-27 launch doesn't disclose which implementation; this is a future technical-blog target.
Cross-tier consistency¶
Open question: what isolation level does a query see when records are mid-write to the live tier but not yet committed to the cold tier? Plausible options:
- Read-committed-cross-tier. Engine reads live-tier records up to the broker's high water mark, plus cold-tier records up to the last Iceberg snapshot. May see records in live that aren't yet in cold.
- Snapshot-isolation against an Iceberg snapshot ID. Query pinned to a specific snapshot; live-tier records past that snapshot are excluded. Sacrifices freshness but is reproducible.
- Eventually-consistent. Engine doesn't guarantee a unified consistency level; consumers tolerant.
Not disclosed in the 2026-05-27 launch; a real production SQL substrate has to commit to one.
Implementation considerations¶
- Snapshot lag affects the boundary. The cold-tier commit cadence (operator-tunable on Iceberg Topics) sets how recently records have moved from live-only to live+cold. Aggressive commit cadence = tighter boundary, more catalog churn. Looser commit cadence = wider live-only window, less metadata pressure.
- Dedup at the boundary. If the engine reads a record from both tiers (because the snapshot was committed but the live record hadn't been pruned), result-set dedup is required.
- Schema evolution alignment. The hot tier (broker schema registry) and the cold tier (Iceberg schema) must agree. Schema changes that aren't compatible across tiers break queries.
- Predicate-pushdown cost asymmetry. Iceberg supports rich metadata-skipping; broker logs don't. A query that scans millions of rows in the live tier is much more expensive than the same scan on cold-tier Parquet.
Canonical wiki instance¶
- systems/redpanda-sql over systems/redpanda-iceberg-topics (2026-05-27 GA) — the canonical wiki instance: Oxla MPP engine reads transparently across the Iceberg Topics dual-tier substrate, with the broker handling the simultaneous-write that makes the bridge feasible.
Adjacent patterns¶
- patterns/lambda-architecture-for-fresh-and-complete-sequences — the consumer-side-merge alternative. Two engines (batch + stream), custom merge logic. The pattern this one collapses by pushing the merge into the engine.
- patterns/in-vpc-query-engine-on-streaming-substrate — the deployment-side complement. This pattern is about read semantics; that pattern is about engine location.
- patterns/streaming-broker-as-lakehouse-bronze-sink — the upstream-substrate pattern that produces the cold tier as a byproduct of the broker's normal write path.
Seen in¶
- 2026-05-27 — sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline — first wiki canonicalisation as a dedicated pattern.
Caveats¶
- Substrate-dependent. Requires a dual-tier-write substrate (Iceberg Topics or equivalent). Without it, reduces to Lambda.
- Mechanism not disclosed. 2026-05-27 source canonicalises the user-facing property (single statement, transparent routing) but not the engine-side routing mechanism, predicate pushdown rules, or cross-tier consistency.
- Read-only framing. DML / DDL semantics across the two tiers unaddressed.
- No quantitative latency claims. Whether the bridge adds query-planning overhead vs cold-tier-only or live-tier-only is not benchmarked.