CONCEPT Cited by 1 source
Optimizer statistics as skipping substrate¶
Definition¶
The architectural framing that table statistics are not just plan-quality input for a cost-based optimiser — they are the substrate that makes data skipping possible. Without per-file min/max/null-count metadata, a query engine cannot decide which files to skip; it has to read everything and apply the predicate at scan time. With those statistics in place, the engine prunes files at planning time, reducing I/O and compute proportionally to predicate selectivity.
The 2026-05-27 Databricks BI Serving Pointers source distinguishes two statistics planes that both serve this skipping role:
"It collects both Delta data-skipping statistics and query optimizer statistics during Photon writes, and back-fills stats for existing tables."
- Delta data-skipping statistics — per-file min/max/null counts on configured columns, embedded in the Delta transaction-log entries. Drive file pruning.
- Query optimizer statistics — table / column / partition cardinalities and value distributions. Drive plan choice: join order, broadcast vs shuffle, filter push-down ordering.
Both are "the substrate that makes skipping possible" in slightly different senses — the first skips files, the second skips bad plans.
The two roles¶
Query: SELECT … WHERE date BETWEEN '2026-05-01' AND '2026-05-07'
AND customer_tier = 'enterprise'
AND region = 'us-east'
┌─────────────────────────────────────────────────────────┐
│ ROLE 1: file pruning (data-skipping stats) │
│ │
│ For each file in the table: │
│ if file.date.max < '2026-05-01' → skip │
│ if file.date.min > '2026-05-07' → skip │
│ if file.region.distinct ⊄ {'us-east'} → skip │
│ │
│ Result: file count read drops from 10,000 → 50 │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ROLE 2: plan choice (query-optimizer stats) │
│ │
│ How many distinct customers per region? → broadcast │
│ the customer-tier dimension (small enough) │
│ How many fact rows per customer? → use hash join │
│ (customer is the high-cardinality side) │
│ What's the post-filter cardinality? → don't bother │
│ with bloom filter (selectivity is high) │
│ │
│ Result: plan reads 50 files via broadcast-hash join │
│ instead of 50 files via shuffle-merge join │
└─────────────────────────────────────────────────────────┘
Why this matters specifically for BI¶
BI workloads are filter-heavy and repetitive. The same filter predicate (last 7 days, this region, this product category) runs thousands of times. Two consequences:
- File pruning compounds: a predicate that prunes 99% of files saves 99% of I/O on every invocation. For a daily dashboard widget, this is 99% I/O savings every day.
- Plan-choice quality compounds: a star-schema join hit by thousands of different filter predicates uses the same join shape. Getting the join order right once helps every query that hits that shape.
The Databricks source: "For BI workloads with repetitive filter patterns, the impact is especially significant — better statistics mean better data skipping and more efficient query plans." This is the architectural mechanism.
Inline collection on the write path¶
The Databricks source's specific operational claim:
"It collects both [stats classes] during Photon writes, and back-fills stats for existing tables."
Three implications:
- Stats are computed on the write path, not in a separate ANALYZE pass. Freshly-written data has fresh stats.
- The compute cost is amortised against the write itself, not as a separate scheduled compute event.
- Existing tables get back-filled so the benefit isn't gated on table re-creation.
The architectural shape: the substrate that knows about the data writes is also the substrate that maintains the stats. No separate pipeline; no drift between write activity and stats freshness.
When stats become the bottleneck¶
The skipping-substrate role surfaces a load-bearing failure mode: when stats are stale, skipping doesn't work, and BI queries revert to full-table scans. Symptoms:
- Filter predicates that should be selective are slow.
- Query plans don't show file pruning in the EXPLAIN output.
- Cost spikes proportional to data growth, not query volume.
This is why Predictive Optimization's automatic stats collection is "one of the highest-return, lowest-effort optimizations you can make": stale stats silently degrade BI performance, and keeping them fresh manually is operationally awkward.
Sibling concepts on the wiki¶
| Sibling | Domain | What gets skipped |
|---|---|---|
| concepts/partition-pruning | Partitioned tables | Whole partition directories |
| Bloom filters | Lookup queries | Whole files / blocks |
| concepts/clickhouse-data-part | ClickHouse | Whole data parts |
Skip indexes (ClickHouse MinMaxIdx) |
Per-block min/max | Block reads within a part |
| Index range pruning | B-tree / B+-tree | Tree branches |
The shared principle across all of these: metadata that summarises content lets the engine prune content reads. Statistics are the most general such metadata.
Relation to query optimization¶
The skipping-substrate framing reframes optimiser statistics: they are not just an input to the query optimiser; they are the substrate the optimiser needs to skip work. A "smart optimiser without stats" is a contradiction — without stats, it's guessing, not optimising.
This is why automatic stats collection is a precondition for substrate-owned optimisation: only when stats are reliably fresh can the substrate make optimisation decisions on the user's behalf. Without it, the substrate is forced into conservative choices that the user has to override per-query (hint syntax, manual rewrites).
Failure modes¶
- Stats backfill cost on legacy tables. When automatic stats is enabled on an existing fleet, the back-fill itself is a one-time large compute cost. Mitigation: rate-paced back-fill.
- Wrong-column statistics. If the substrate maintains stats on columns that aren't actually used as filter predicates, the cost is wasted. Mitigation: workload-driven stats column selection (the predictive part of Predictive Optimization).
- High-cardinality column stats are big. Statistics on
high-cardinality columns (
user_idover a 1B-user table) can be larger than the data they describe, defeating the skipping purpose. Mitigation: choose stat columns by predicate prevalence + cardinality skew.
Seen in¶
- sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco — first wiki canonicalisation of the optimizer-stats-as-skipping-substrate framing. Distinguishes data-skipping vs query-optimizer statistics, names the inline- during-Photon-writes collection plus existing-table back-fill, and ties the BI-workload value to filter-heavy + repetitive query patterns. Reserved for future ingests: the predictive scheduler's column-selection algorithm, stats-storage cost geometry, the relationship to Z-ORDER / liquid clustering layout decisions, and the back-fill pacing.
Related¶
- systems/databricks-predictive-optimization — canonical instance of automatic stats collection.
- systems/delta-lake — the table format whose data-skipping stats live in the transaction log.
- systems/photon — the write-path engine that emits stats inline.
- concepts/automatic-table-optimization — the broader substrate-owned-maintenance shape.
- concepts/partition-pruning — sibling skipping mechanism on partitioned tables.