Skip to content

CONCEPT Cited by 1 source

Ad-hoc vs predefined streaming SQL

A structural axis in streaming-SQL substrate design: whether the consumer must declare queries up front so the engine can materialise their results (predefined), or can issue queries ad-hoc at the moment a question arises (ad-hoc). Both classes have legitimate workloads; the choice has structural consequences for which workloads the substrate can serve. First wiki canonicalisation 2026-05-27.

The two classes

Predefined streaming SQL

The consumer declares queries (typically as CREATE STREAM / CREATE TABLE / CREATE MATERIALIZED VIEW statements) and the engine maintains the results continuously as new records arrive. At read time, the consumer queries the maintained state — often looking up by key in a state store.

Wiki-canonical instances at the streaming altitude:

  • ksqlDB (Confluent / Apache Kafka ecosystem) — CREATE STREAM / CREATE TABLE declarations create persistent stream-processing topologies.
  • Flink SQL — Flink jobs declared in SQL syntax that compile to Flink streaming-execution graphs.
  • Materialize (the database) — incremental view maintenance over streaming inputs.
  • Metric view materialisation (Databricks Metric Views) — the warehouse- side analogue.

Ad-hoc streaming SQL

The consumer issues SQL queries at the moment a question arises; the engine plans and executes the query against the current state of the data. There is no pre-declaration; the substrate must serve arbitrary queries at low latency without prior knowledge of what they will be.

Wiki-canonical instances:

  • systems/redpanda-sql (2026-05-27 GA) — the canonical wiki ad-hoc streaming SQL substrate: connect a Postgres client, write a query, get a result, against live + Iceberg-history data.
  • (Adjacent classes on transactional / analytical substrates): Trino, Presto, BigQuery, Snowflake — all ad-hoc, but on warehouse- ingestion-pipeline substrates rather than directly on the streaming substrate.

Verbatim disclosure (Redpanda 2026-05-27)

"There are no materialized views to predefine, nor a proprietary storage tier that shields data from other tools. No streaming pipelines to build before the data arrives. ksqlDB is a handy tool, but it requires you to decide what questions you're going to ask before the events arrive, which requires a level of foresight that most data quality problems, incident postmortems, or agent-driven analytics work suggest we do not actually have."

"Redpanda SQL is fully ad hoc: connect a client, write a query, get results."

(Source: sources/2026-05-27-redpanda-redpanda-sql-is-ga-the-query-engine-that-skips-the-pipeline)

When each class fits

Workload property Predefined fits Ad-hoc fits
Query distribution Small set of repeated queries with high QPS Long tail of distinct queries with unpredictable arrival rate
Latency budget Microseconds (key lookup against pre-computed result) Milliseconds-to-seconds (compute query at request time)
Question known up-front Yes (canonical operational dashboards, pre-defined fraud rules) No (incident postmortems, agent-driven exploration)
Resource profile Continuous compute on the stream-processing topology Burst compute at query time
Schema-evolution sensitivity High (each schema change requires query / topology republish) Low (queries written against current schema)
Cost model Continuous (always-on stream processor) Per-query (consumption-based)

The Redpanda 2026-05-27 launch identifies three explicit workloads where predefined fails structurally: incident postmortems (you don't know in advance which incident will happen, so you can't predefine the question), data quality problems (the bug in producer code, by definition, wasn't anticipated), and agent-driven analytics (the agent generates queries during its reasoning loop). For these, the foresight tax of declaring queries up-front is structurally infinite — you can't predefine the question.

The foresight tax

The article's central argument against predefined-only substrates:

"…it requires you to decide what questions you're going to ask before the events arrive, which requires a level of foresight that most data quality problems, incident postmortems, or agent-driven analytics work suggest we do not actually have."

The foresight tax is the cost of querying questions you didn't predefine: in a predefined-only substrate, the cost is "can't answer". In a partially-predefined substrate (predefined for common queries + dump-to-warehouse for the rest), the cost is the ETL-pipeline lag plus the warehouse compute — what the launch post calls "Option B": dump topics to JSON, write a Python script, spin up Spark, do the work in 40 minutes that "a SQL query should do in four seconds."

Both classes co-exist legitimately

The framing isn't predefined vs ad-hoc as good vs bad — it's that both classes are legitimate, and the substrate choice depends on workload distribution. The 2026-05-27 launch acknowledges ksqlDB as "a handy tool"; the argument is structural fit, not quality.

A complete data platform plausibly serves both:

  • Predefined-streaming-SQL for the small set of repeated, low- latency, dashboard-driven queries (operational fraud rules, key lookups, real-time metric aggregation).
  • Ad-hoc streaming SQL for the long tail of incident investigation, data-quality debugging, hypothesis exploration, and agent-driven analytics.

The Redpanda Data Platform's GA position is that it ships ad-hoc (via Redpanda SQL) and lets the user combine with external predefined-streaming-SQL substrates as needed, rather than shipping both in-cluster.

Why agents tilt the workload mix toward ad-hoc

The rise of agent-driven query fan-out structurally tilts the streaming-SQL workload distribution toward ad-hoc: agents generate questions during reasoning, can't predefine. The launch post's positioning is that this tilt is large enough to motivate a dedicated ad-hoc substrate inside the streaming cluster, not handle agent fan-out via warehouse-side ingestion (whose lag breaks freshness) or predefined-streaming-SQL (whose foresight requirement breaks the agent's exploration loop).

Canonical wiki instance

  • systems/redpanda-sql (2026-05-27 GA) — first wiki canonicalisation of an explicitly-positioned-ad-hoc streaming- SQL substrate, with ksqlDB named as the predefined competitive foil.

Seen in

Caveats

  • Vendor-positioning framing. The dichotomy is sharper in marketing positioning than in production: real systems often combine predefined materialised views with ad-hoc query surfaces. Materialize, in particular, advertises both shapes.
  • Predefined ≠ stuck. Predefined queries can be republished; ksqlDB / Flink SQL queries can be added or modified. The "foresight tax" is real but not absolute.
  • Ad-hoc has its own costs. Each ad-hoc query pays compute at query time; for very-high-QPS dashboards over hot keys, the predefined-materialised-view path is cheaper.
  • Hybrid substrates exist. The Lambda-architecture two-engine shape (batch + stream) is one historical answer; modern HTAP databases collapse both into one engine. The ad-hoc-vs-predefined axis is orthogonal to where the engine lives in the storage / streaming hierarchy.
Last updated · 542 distilled / 1,571 read