CONCEPT Cited by 1 source

Query size derived from the logical plan¶

Query size derived from the logical plan is the primitive of estimating the resource cost of a query from its optimiser-internal representation, before execution, so that the estimate can feed routing / scheduling / capacity decisions at the gateway layer rather than inside the execution engine.

Canonical instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

The Databricks Serverless Gateway uses "estimated query size (derived from the logical plan)" as one of three routing signals, alongside current cluster utilisation and latency profile.

Why this is load-bearing¶

Gateway-level workload-aware routing needs a cost signal before the query runs. Three common alternatives fall short:

Post-hoc runtime measurement — only available after the query has already been placed on a cluster
Static heuristics on raw SQL text — fooled by SELECT * FROM big_table WHERE id = 1 (text-short but execution-short), or SELECT AVG(col) FROM small_table (text-short but execution-heavy due to full scan)
User-supplied hints — unreliable and requires user tuning, contradicting the stability-as-system-property design thesis

The logical plan sits between raw SQL and the physical plan:

It's produced by the optimiser's parser + analyser
It carries table/column metadata, join structure, predicates
It's stable enough for cost estimation but not yet bound to execution-layer details

This is precisely the abstraction level where routing decisions should be made — the optimiser has enough information to estimate work, but the physical plan + cluster binding haven't been committed yet.

Why Spark Connect makes this practical¶

Pre–Spark Connect, user application code ran inside the driver process, so the logical plan didn't exist at the gateway boundary — queries appeared to the infrastructure as opaque application processes. Spark Connect changes this: the gRPC protocol carries serialised logical plans from client to driver, which means the gateway intercepting the gRPC traffic can inspect the plan without executing anything. See systems/spark-connect for the architectural substrate.

Generalisable lesson: driver-client architectural splits that send query plans over the wire naturally enable gateway-level query-aware routing.

What "size" estimation typically includes¶

A logical-plan-derived cost signal in a Spark-family system typically combines:

Estimated input cardinality from table statistics (see concepts/cardinality-estimation)
Selectivity estimates for predicates
Expansion factors for joins (cross joins vs equi-joins vs range joins)
Aggregation output cardinality (small for COUNT, large for GROUP BY on high-cardinality columns)
Number of shuffle boundaries (proxy for stage count)

Not every logical-plan cost model uses all five — the Databricks post doesn't disclose which subset feeds the gateway's size estimate.

Known limitations¶

Logical-plan-derived cost is an estimate, not ground truth:

Cardinality estimation errors — skewed data, missing stats, selectivity-misjudgement on LIKE predicates (see concepts/like-predicate-cardinality-estimation-failure)
UDF opacity — user-defined functions are black boxes; the optimiser can't see through them
Dynamic parameters — prepared-statement bind values aren't known at plan time
Join-order sub-optimality — the initial logical plan may not be the optimal plan (see concepts/join-order-optimization)

For the routing use case, approximate correctness is sufficient — the Gateway continuously re-evaluates placement as conditions shift, so an initial mis-sized estimate is self-correcting.

Sibling concepts¶

concepts/cardinality-estimation — the underlying statistical primitive
concepts/query-planner — the optimiser that produces the logical plan
concepts/abstract-syntax-tree — the structural representation a logical plan is built on
concepts/join-order-optimization — a downstream concern the optimiser handles after routing

Seen in¶

sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — First canonical wiki instance of logical-plan-derived query size as a gateway-routing signal. The systems/databricks-serverless-gateway uses it as one of three routing signals. The pattern composes with patterns/multi-signal-workload-aware-gateway-routing and is architecturally enabled by Spark Connect's gRPC plan-carrying protocol.

systems/databricks-serverless-gateway — consumer of this signal
systems/spark-connect — the substrate making it available
systems/apache-spark — the engine whose logical-plan representation is used
patterns/multi-signal-workload-aware-gateway-routing — the pattern this signal feeds into