Skip to content

CONCEPT Cited by 1 source

Query size derived from the logical plan

Query size derived from the logical plan is the primitive of estimating the resource cost of a query from its optimiser-internal representation, before execution, so that the estimate can feed routing / scheduling / capacity decisions at the gateway layer rather than inside the execution engine.

Canonical instance (Source: sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance):

The Databricks Serverless Gateway uses "estimated query size (derived from the logical plan)" as one of three routing signals, alongside current cluster utilisation and latency profile.

Why this is load-bearing

Gateway-level workload-aware routing needs a cost signal before the query runs. Three common alternatives fall short:

  • Post-hoc runtime measurement — only available after the query has already been placed on a cluster
  • Static heuristics on raw SQL text — fooled by SELECT * FROM big_table WHERE id = 1 (text-short but execution-short), or SELECT AVG(col) FROM small_table (text-short but execution-heavy due to full scan)
  • User-supplied hints — unreliable and requires user tuning, contradicting the stability-as-system-property design thesis

The logical plan sits between raw SQL and the physical plan:

  • It's produced by the optimiser's parser + analyser
  • It carries table/column metadata, join structure, predicates
  • It's stable enough for cost estimation but not yet bound to execution-layer details

This is precisely the abstraction level where routing decisions should be made — the optimiser has enough information to estimate work, but the physical plan + cluster binding haven't been committed yet.

Why Spark Connect makes this practical

Pre–Spark Connect, user application code ran inside the driver process, so the logical plan didn't exist at the gateway boundary — queries appeared to the infrastructure as opaque application processes. Spark Connect changes this: the gRPC protocol carries serialised logical plans from client to driver, which means the gateway intercepting the gRPC traffic can inspect the plan without executing anything. See systems/spark-connect for the architectural substrate.

Generalisable lesson: driver-client architectural splits that send query plans over the wire naturally enable gateway-level query-aware routing.

What "size" estimation typically includes

A logical-plan-derived cost signal in a Spark-family system typically combines:

  1. Estimated input cardinality from table statistics (see concepts/cardinality-estimation)
  2. Selectivity estimates for predicates
  3. Expansion factors for joins (cross joins vs equi-joins vs range joins)
  4. Aggregation output cardinality (small for COUNT, large for GROUP BY on high-cardinality columns)
  5. Number of shuffle boundaries (proxy for stage count)

Not every logical-plan cost model uses all five — the Databricks post doesn't disclose which subset feeds the gateway's size estimate.

Known limitations

Logical-plan-derived cost is an estimate, not ground truth:

  • Cardinality estimation errors — skewed data, missing stats, selectivity-misjudgement on LIKE predicates (see concepts/like-predicate-cardinality-estimation-failure)
  • UDF opacity — user-defined functions are black boxes; the optimiser can't see through them
  • Dynamic parameters — prepared-statement bind values aren't known at plan time
  • Join-order sub-optimality — the initial logical plan may not be the optimal plan (see concepts/join-order-optimization)

For the routing use case, approximate correctness is sufficient — the Gateway continuously re-evaluates placement as conditions shift, so an initial mis-sized estimate is self-correcting.

Sibling concepts

Seen in

Last updated · 451 distilled / 1,324 read