SYSTEM Cited by 2 sources
Databricks Predictive Optimization¶
Predictive Optimization is a Databricks-managed,
default-on-for-UC-managed-tables substrate that automatically
runs OPTIMIZE, VACUUM, and statistics collection on tables
that would benefit, so users "don't need to schedule these jobs
yourself". It collects both Delta data-skipping statistics
and query optimizer statistics during Photon
writes, and back-fills stats for existing tables. The disclosed
performance envelope: "In observed workloads, this delivered an
average 22% performance
improvement".
For BI workloads with repetitive filter patterns, "the impact is
especially significant — better statistics mean better data
skipping and more efficient query plans". (Source:
sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco)
What it does¶
Three classes of automatic maintenance, all on a workload-aware schedule:
| Operation | What it does | Why it matters for BI |
|---|---|---|
OPTIMIZE |
Compacts small files into larger ones; rewrites file layout to align with clustering keys. | Reduces metadata pressure; improves data skipping by clustering co-located rows. |
VACUUM |
Removes unreferenced files past the retention period. | Reduces storage cost and metadata-listing overhead. |
| Statistics collection | Computes and refreshes both Delta data-skipping statistics (per-file min/max/null counts on selected columns) and query-optimizer statistics (cardinalities, distributions). | Enables data skipping (read fewer files) and better join orders (cost-based optimisation with accurate stats). |
Verbatim from the source: "Predictive Optimization automatically runs OPTIMIZE, VACUUM, and statistics collection on tables that would benefit from these operations — so you don't need to schedule these jobs yourself."
The "that would benefit from these operations" clause is the predictive part: the system observes workload patterns and schedules maintenance only where the cost is justified by the expected query speedup.
The two statistics planes¶
The source explicitly names a distinction not previously canonicalised on the wiki:
"It collects both Delta data-skipping statistics and query optimizer statistics during Photon writes, and back-fills stats for existing tables."
Two distinct purposes:
- Data-skipping statistics — per-file min/max/null counts on configured columns, embedded in Delta transaction-log entries. Used during file-list construction to skip files whose min/max range cannot satisfy the query predicate. The query-time saving is I/O — fewer files read.
- Query-optimizer statistics — table / column / partition cardinalities and value distributions used by the cost-based optimiser to choose join order, broadcast vs shuffle, filter push-down ordering. The query-time saving is plan quality.
Both are maintained by Predictive Optimization without user intervention. See concepts/optimizer-statistics-as-skipping-substrate for the generalised principle.
Inline collection during Photon writes¶
The source makes a specific operational claim: "It collects both [stats classes] during Photon writes". This is significant because:
- Stats are computed on the write path, not as a separate background job — so freshly-written data has fresh stats.
- The compute cost of stats collection is amortised against the write itself, not as a separate scheduled compute event.
- For tables with high write churn, stats stay current without a follow-up pass.
For tables that pre-existed Predictive Optimization, the substrate back-fills stats — "and back-fills stats for existing tables" — so the benefit is not gated on table re-creation.
Composition with managed tables¶
Predictive Optimization is a defining property of Unity Catalog managed tables — "Unity Catalog managed tables are the foundation for everything else in this stack. Unity Catalog manages all read, write, storage, and optimization responsibilities for managed tables. This unlocks automatic features you don't get with external tables: Predictive Optimization (covered below) is enabled by default."
The architectural shape: the substrate (UC) takes ownership of optimisation, so the user doesn't have to. This generalises at patterns/managed-table-as-default-storage-layer — choose managed tables by default; reserve external tables for the cases where you genuinely need customer-owned storage paths.
Composition with liquid clustering: CLUSTER BY AUTO¶
The source discloses a Predictive-Optimization-driven feature on Liquid Clustering that was not previously canonicalised on the wiki:
"If you're not sure which columns to choose,
CLUSTER BY AUTOlets Predictive Optimization select keys based on observed query patterns."
This is a workload-aware automated decision: instead of the architect committing to clustering keys at table-creation time (which historically required workload prediction), the substrate observes query patterns over time and selects clustering keys automatically. This is consistent with the broader thesis: the substrate owns optimisation, the user owns intent.
Performance envelope¶
The disclosed number: "average 22% performance improvement" in observed workloads. Caveats from the source itself:
- "Observed workloads" — corpus / methodology not disclosed in this post (a linked separate post documents the figure).
- "Average" — distribution shape unknown; some workloads see more, some less.
- "For BI workloads with repetitive filter patterns, the impact is especially significant" — implying BI is upper-tail of the distribution, but with no specific BI-only number.
The qualitative claim: better statistics → better data skipping + better query plans → less data scanned and better join orders → faster queries and lower compute cost.
Why this matters for BI specifically¶
The source's argument: BI queries are repetitive and filter-heavy, so the leverage of fresh stats compounds in three ways:
- The same filter predicates run thousands of times — every data-skipping decision compounds across query volume.
- Star-schema joins have a small number of join shapes — the optimizer-statistics improvements compound across all queries that hit the same shape.
- New data lands continuously — without auto-stats-collection, stats drift and the optimiser falls back to default heuristics that are wrong for filter-heavy workloads.
"For BI workloads with repetitive filter patterns, the impact is especially significant — better statistics mean better data skipping and more efficient query plans."
The source's recommendation: "Enable Predictive Optimization at the catalog level and let it run. Using Predictive Optimization is one of the highest-return, lowest-effort optimizations you can make."
Position in the BI serving stack¶
Consumers AI/BI Dashboards / Genie / notebooks / third-party BI
Semantic Metric Views (define metric ONCE)
Materialization Pre-aggregated results
Physical Gold star schema on UC Managed Tables
+ Liquid Clustering (with CLUSTER BY AUTO option)
+ Predictive Optimization ◄── this page
(auto-OPTIMIZE / VACUUM / stats; default-on)
Predictive Optimization is the physical-layer compounding lever: every layer above (materialization, semantic, consumer) benefits from the data-skipping and plan-quality wins. The architectural claim from the source: "every query benefits — before you've touched the semantic layer."
Where it shows up on the wiki¶
| Source / system | Use of Predictive Optimization |
|---|---|
| systems/uc-managed-tables | Default-on managed-table property; "up to 20× faster queries and 50% lower storage costs" (cited via UC managed-table page from earlier disclosure). |
| systems/liquid-clustering | Refreshes the cluster-key layout automatically; gains the CLUSTER BY AUTO option (key selection from observed query patterns). |
| systems/uc-otel-trace-tables | Substrate property; auto-liquid-clustered + Predictive-Optimization-managed. |
| sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco | First wiki canonicalisation as a dedicated system page; 22% average improvement disclosed. |
Promotion note¶
Before this page existed, "Predictive Optimization" appeared as a tag on six systems and concepts pages (catalog-managed-commits, external-engine-write-to-managed-table, delta-kernel, delta-lake, liquid-clustering, uc-managed-tables) but had no dedicated page. The 2026-05-27 BI Serving Pointers source quotes the verbatim mechanism (auto-OPTIMIZE / VACUUM / stats collection inline with Photon writes + back-fill for existing tables + 22% average gain), which justified promotion.
Seen in¶
- sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco
— first wiki disclosure of Predictive Optimization as a
distinct named system (previously implicit). Names the
three operation classes (
OPTIMIZE/VACUUM/ stats collection), the two statistics planes (data-skipping vs query optimizer), the Photon-write-time inline collection, the existing-table back-fill, theCLUSTER BY AUTOLiquid Clustering integration, the 22% average performance number, and the catalog-level enablement contract. Reserved for future ingests: the predictive scheduler's decision criteria, per-table opt-out semantics, the back-fill pacing under high table count, the relationship toOPTIMIZE ZORDER BY(which the article presents as superseded by liquid clustering + Predictive Optimization), and the "observed workloads" corpus behind the 22% figure. - sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo
— Predictive Optimization's load-bearing role at PB scale:
the headline 7.7× speedup on Arctic Wolf's 3.8 PB security
telemetry table is attributed to "Liquid Clustering on Unity
Catalog managed tables with Predictive Optimization" — the
three substrate properties working together. Direct disclosures:
(a) OPTIMIZE planning improved 12h → 23m on 10 PB tables;
(b) OPTIMIZE execution 5× faster on Medium DBSQL clusters;
(c) automatic clustering maintenance is the property that makes
data freshness improve from "hours to minutes" after migration
to Liquid Clustering. The post canonicalises the previously-
implicit prescription against
OPTIMIZE ZORDER BY— that legacy layout maintenance technique is replaced by Liquid Clustering with Predictive-Optimization-managed incremental clustering on write.
Related¶
- systems/uc-managed-tables — the substrate Predictive Optimization runs on by default.
- systems/unity-catalog — the catalog where enablement is configured.
- systems/delta-lake — the underlying table format whose data-skipping statistics live in the transaction log.
- systems/liquid-clustering — the layout primitive Predictive
Optimization maintains; gains
CLUSTER BY AUTOautomatic key selection. - systems/photon — the write-path engine that emits stats inline.
- systems/databricks-metric-views — the semantic-layer primitive whose materialization sits on tables Predictive Optimization manages.
- concepts/automatic-table-optimization — the generalised concept.
- concepts/optimizer-statistics-as-skipping-substrate — the generalised principle: stats are not just plan-quality input, they are the substrate that makes data skipping possible.
- patterns/managed-table-as-default-storage-layer — the deployment pattern that makes Predictive Optimization the default.