CONCEPT Cited by 1 source

In-cluster streaming SQL¶

In-cluster streaming SQL is the architectural property where the analytical query engine is deployed inside the same cluster as the streaming broker and storage substrate, so SQL queries access streaming data and historical data in-place without crossing a network boundary or egressing the customer's cloud account / VPC. First wiki canonicalisation 2026-05-27.

Definition¶

A streaming-platform deployment satisfies in-cluster streaming SQL when:

The query engine runs on the same infrastructure as the streaming brokers (same VPC, same cloud account, same managed service unit).
Reads happen in place. The engine accesses the broker's live log segments and the cold-tier storage (object store / Iceberg) directly, not through an ingestion or replication pipeline.
No third-party compute service sits in the query path. The data is not sent to an external SaaS warehouse, an external cloud-side query service, or a separate vendor's compute.
The customer's data never egresses the VPC to be queried. Compliance / data-residency / cross-cloud cost guarantees on the storage substrate also hold for the analytical-compute substrate.

Verbatim disclosure (Redpanda 2026-05-27)¶

"Redpanda SQL runs on the same infrastructure as your brokers, inside your VPC, and every query accesses data in-place, in both the hot (stream) and cold (Iceberg table) tiers. Nothing is sent to a third-party compute service, which is critical if you have compliance requirements (and just as important for strong cybersecurity hygiene), working within your existing infosec-approved environment."

(Source: sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline)

Comparison with neighbouring shapes¶

Shape	Engine location	Data movement
In-cluster streaming SQL (this concept)	Same cluster + VPC as broker + storage	None — engine reads live + cold tiers in place
Warehouse-with-ingestion (Snowflake, BigQuery)	External SaaS warehouse	Records replicated via connector → warehouse storage
External federated engine (Trino on object store)	Separate compute cluster	Cold-tier reads in place; live-tier needs ingestion
In-broker materialised view (ksqlDB, Flink SQL)	Inside broker / co-located stream engine	None for predefined queries; ad-hoc not supported
Lambda-architecture batch + stream merge	Two engines (Spark + Flink)	None on read, but two pipelines on write

The in-cluster streaming SQL shape is structurally distinct from all four. It has the read-path-in-place property of an in-broker engine, the ad-hoc property of an external engine, and the single-cluster operational footprint of neither. The trade-off it makes: the engine and the broker share infrastructure (potential for noisy-neighbor / resource contention), and the analytical-compute autoscaling is bounded by the cluster shape rather than independently scalable.

Why this property matters¶

Three structural payoffs:

Compliance + data residency at the analytical-compute layer. If the storage substrate is BYOC (data resides in the customer's account) but the query engine is external SaaS, the compliance story has a gap: data must egress to be queried. In-cluster streaming SQL closes that gap. Verbatim from the 2026-05-27 Redpanda disclosure: "Regulated data that cannot egress to an external SaaS provider can now be queried directly within your VPC, without procuring a separate query engine or moving data across providers, regions, or network zones."
Latency floor reduction. Ingestion-pipeline latency (warehouse-side) is "a few seconds to minutes of arrival delay"; in-cluster SQL reads the live broker tier with no ingestion-side delay. The 2026-05-27 launch's wiki-canonical example: "SELECT * FROM orders WHERE status = 'failed' AND timestamp > NOW() - INTERVAL '30 minutes'. Results in seconds."
Operational footprint reduction. No separate query-engine cluster to procure, scale, govern, secure, or pay for. The vendor-positioning argument in the launch post: "One architecture. One operational model. One vendor."

Pre-condition: a substrate that exposes a unified table abstraction¶

In-cluster streaming SQL requires a substrate that the query engine can read across both tiers without consumer-side reconciliation. The canonical wiki instance of this is the Iceberg topic — a single logical entity that's both a Kafka-protocol topic and an Iceberg table backed by the same data, with simultaneous-write to live broker logs and Parquet/Iceberg files in object storage.

Without that substrate, in-cluster streaming SQL would require the engine to either (a) only query live records (losing historical view), (b) only query cold records (losing freshness), or (c) implement consumer-side reconciliation between two independently- written tiers (the Lambda-architecture shape).

Canonical wiki instance¶

systems/redpanda-sql (2026-05-27 GA) — the canonical wiki instance: Postgres-wire MPP query engine (Oxla) deployed inside the Redpanda BYOC cluster, querying live topics
Iceberg cold tier in place via the Iceberg Topics substrate.

Seen in¶

2026-05-27 — sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline — first wiki canonicalisation; Redpanda SQL GA materialises the property over the Iceberg-Topics substrate.
2025-10-28 — sources/2025-10-28-redpanda-introducing-the-agentic-data-plane — pre-disclosure of the architectural shape via Oxla acquisition (rpk oxla runs in-cluster).

Caveats¶

Shared-infrastructure noisy-neighbor risk. Analytical workloads can compete with streaming workloads for CPU / memory / I/O on the same nodes. The 2026-05-27 Redpanda post doesn't disclose the resource-isolation model.
Independent autoscaling bounded by cluster shape. External warehouse engines can scale analytical compute independently of streaming throughput. In-cluster SQL is bounded by the BYOC cluster's compute envelope unless the operator scales the cluster.
Vendor lock-in axis. The convenience of one-vendor / one-cluster trades against the flexibility of separately-chosen best-of-breed components for streaming, storage, and analytics.