CONCEPT Cited by 1 source
Query size derived from the logical plan¶
Query size derived from the logical plan is the primitive of estimating the resource cost of a query from its optimiser-internal representation, before execution, so that the estimate can feed routing / scheduling / capacity decisions at the gateway layer rather than inside the execution engine.
Canonical instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):
The Databricks Serverless Gateway uses "estimated query size (derived from the logical plan)" as one of three routing signals, alongside current cluster utilisation and latency profile.
Why this is load-bearing¶
Gateway-level workload-aware routing needs a cost signal before the query runs. Three common alternatives fall short:
- Post-hoc runtime measurement — only available after the query has already been placed on a cluster
- Static heuristics on raw SQL text — fooled by
SELECT * FROM big_table WHERE id = 1(text-short but execution-short), orSELECT AVG(col) FROM small_table(text-short but execution-heavy due to full scan) - User-supplied hints — unreliable and requires user tuning, contradicting the stability-as-system-property design thesis
The logical plan sits between raw SQL and the physical plan:
- It's produced by the optimiser's parser + analyser
- It carries table/column metadata, join structure, predicates
- It's stable enough for cost estimation but not yet bound to execution-layer details
This is precisely the abstraction level where routing decisions should be made — the optimiser has enough information to estimate work, but the physical plan + cluster binding haven't been committed yet.
Why Spark Connect makes this practical¶
Pre–Spark Connect, user application code ran inside the driver process, so the logical plan didn't exist at the gateway boundary — queries appeared to the infrastructure as opaque application processes. Spark Connect changes this: the gRPC protocol carries serialised logical plans from client to driver, which means the gateway intercepting the gRPC traffic can inspect the plan without executing anything. See systems/spark-connect for the architectural substrate.
Generalisable lesson: driver-client architectural splits that send query plans over the wire naturally enable gateway-level query-aware routing.
What "size" estimation typically includes¶
A logical-plan-derived cost signal in a Spark-family system typically combines:
- Estimated input cardinality from table statistics (see concepts/cardinality-estimation)
- Selectivity estimates for predicates
- Expansion factors for joins (cross joins vs equi-joins vs range joins)
- Aggregation output cardinality (small for
COUNT, large forGROUP BYon high-cardinality columns) - Number of shuffle boundaries (proxy for stage count)
Not every logical-plan cost model uses all five — the Databricks post doesn't disclose which subset feeds the gateway's size estimate.
Known limitations¶
Logical-plan-derived cost is an estimate, not ground truth:
- Cardinality estimation errors — skewed data, missing stats,
selectivity-misjudgement on
LIKEpredicates (see concepts/like-predicate-cardinality-estimation-failure) - UDF opacity — user-defined functions are black boxes; the optimiser can't see through them
- Dynamic parameters — prepared-statement bind values aren't known at plan time
- Join-order sub-optimality — the initial logical plan may not be the optimal plan (see concepts/join-order-optimization)
For the routing use case, approximate correctness is sufficient — the Gateway continuously re-evaluates placement as conditions shift, so an initial mis-sized estimate is self-correcting.
Sibling concepts¶
- concepts/cardinality-estimation — the underlying statistical primitive
- concepts/query-planner — the optimiser that produces the logical plan
- concepts/abstract-syntax-tree — the structural representation a logical plan is built on
- concepts/join-order-optimization — a downstream concern the optimiser handles after routing
Seen in¶
- sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — First canonical wiki instance of logical-plan-derived query size as a gateway-routing signal. The systems/databricks-serverless-gateway uses it as one of three routing signals. The pattern composes with patterns/multi-signal-workload-aware-gateway-routing and is architecturally enabled by Spark Connect's gRPC plan-carrying protocol.
Related¶
- systems/databricks-serverless-gateway — consumer of this signal
- systems/spark-connect — the substrate making it available
- systems/apache-spark — the engine whose logical-plan representation is used
- patterns/multi-signal-workload-aware-gateway-routing — the pattern this signal feeds into