Skip to content

SYSTEM Cited by 2 sources

Databricks Predictive Optimization

Predictive Optimization is a Databricks-managed, default-on-for-UC-managed-tables substrate that automatically runs OPTIMIZE, VACUUM, and statistics collection on tables that would benefit, so users "don't need to schedule these jobs yourself". It collects both Delta data-skipping statistics and query optimizer statistics during Photon writes, and back-fills stats for existing tables. The disclosed performance envelope: "In observed workloads, this delivered an average 22% performance improvement". For BI workloads with repetitive filter patterns, "the impact is especially significant — better statistics mean better data skipping and more efficient query plans". (Source: sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco)

What it does

Three classes of automatic maintenance, all on a workload-aware schedule:

Operation What it does Why it matters for BI
OPTIMIZE Compacts small files into larger ones; rewrites file layout to align with clustering keys. Reduces metadata pressure; improves data skipping by clustering co-located rows.
VACUUM Removes unreferenced files past the retention period. Reduces storage cost and metadata-listing overhead.
Statistics collection Computes and refreshes both Delta data-skipping statistics (per-file min/max/null counts on selected columns) and query-optimizer statistics (cardinalities, distributions). Enables data skipping (read fewer files) and better join orders (cost-based optimisation with accurate stats).

Verbatim from the source: "Predictive Optimization automatically runs OPTIMIZE, VACUUM, and statistics collection on tables that would benefit from these operations — so you don't need to schedule these jobs yourself."

The "that would benefit from these operations" clause is the predictive part: the system observes workload patterns and schedules maintenance only where the cost is justified by the expected query speedup.

The two statistics planes

The source explicitly names a distinction not previously canonicalised on the wiki:

"It collects both Delta data-skipping statistics and query optimizer statistics during Photon writes, and back-fills stats for existing tables."

Two distinct purposes:

  • Data-skipping statistics — per-file min/max/null counts on configured columns, embedded in Delta transaction-log entries. Used during file-list construction to skip files whose min/max range cannot satisfy the query predicate. The query-time saving is I/O — fewer files read.
  • Query-optimizer statistics — table / column / partition cardinalities and value distributions used by the cost-based optimiser to choose join order, broadcast vs shuffle, filter push-down ordering. The query-time saving is plan quality.

Both are maintained by Predictive Optimization without user intervention. See concepts/optimizer-statistics-as-skipping-substrate for the generalised principle.

Inline collection during Photon writes

The source makes a specific operational claim: "It collects both [stats classes] during Photon writes". This is significant because:

  • Stats are computed on the write path, not as a separate background job — so freshly-written data has fresh stats.
  • The compute cost of stats collection is amortised against the write itself, not as a separate scheduled compute event.
  • For tables with high write churn, stats stay current without a follow-up pass.

For tables that pre-existed Predictive Optimization, the substrate back-fills stats — "and back-fills stats for existing tables" — so the benefit is not gated on table re-creation.

Composition with managed tables

Predictive Optimization is a defining property of Unity Catalog managed tables"Unity Catalog managed tables are the foundation for everything else in this stack. Unity Catalog manages all read, write, storage, and optimization responsibilities for managed tables. This unlocks automatic features you don't get with external tables: Predictive Optimization (covered below) is enabled by default."

The architectural shape: the substrate (UC) takes ownership of optimisation, so the user doesn't have to. This generalises at patterns/managed-table-as-default-storage-layer — choose managed tables by default; reserve external tables for the cases where you genuinely need customer-owned storage paths.

Composition with liquid clustering: CLUSTER BY AUTO

The source discloses a Predictive-Optimization-driven feature on Liquid Clustering that was not previously canonicalised on the wiki:

"If you're not sure which columns to choose, CLUSTER BY AUTO lets Predictive Optimization select keys based on observed query patterns."

This is a workload-aware automated decision: instead of the architect committing to clustering keys at table-creation time (which historically required workload prediction), the substrate observes query patterns over time and selects clustering keys automatically. This is consistent with the broader thesis: the substrate owns optimisation, the user owns intent.

Performance envelope

The disclosed number: "average 22% performance improvement" in observed workloads. Caveats from the source itself:

  • "Observed workloads" — corpus / methodology not disclosed in this post (a linked separate post documents the figure).
  • "Average" — distribution shape unknown; some workloads see more, some less.
  • "For BI workloads with repetitive filter patterns, the impact is especially significant" — implying BI is upper-tail of the distribution, but with no specific BI-only number.

The qualitative claim: better statistics → better data skipping + better query plans → less data scanned and better join orders → faster queries and lower compute cost.

Why this matters for BI specifically

The source's argument: BI queries are repetitive and filter-heavy, so the leverage of fresh stats compounds in three ways:

  1. The same filter predicates run thousands of times — every data-skipping decision compounds across query volume.
  2. Star-schema joins have a small number of join shapes — the optimizer-statistics improvements compound across all queries that hit the same shape.
  3. New data lands continuously — without auto-stats-collection, stats drift and the optimiser falls back to default heuristics that are wrong for filter-heavy workloads.

"For BI workloads with repetitive filter patterns, the impact is especially significant — better statistics mean better data skipping and more efficient query plans."

The source's recommendation: "Enable Predictive Optimization at the catalog level and let it run. Using Predictive Optimization is one of the highest-return, lowest-effort optimizations you can make."

Position in the BI serving stack

Consumers       AI/BI Dashboards / Genie / notebooks / third-party BI
Semantic        Metric Views (define metric ONCE)
Materialization Pre-aggregated results
Physical        Gold star schema on UC Managed Tables
                + Liquid Clustering (with CLUSTER BY AUTO option)
                + Predictive Optimization      ◄── this page
                  (auto-OPTIMIZE / VACUUM / stats; default-on)

Predictive Optimization is the physical-layer compounding lever: every layer above (materialization, semantic, consumer) benefits from the data-skipping and plan-quality wins. The architectural claim from the source: "every query benefits — before you've touched the semantic layer."

Where it shows up on the wiki

Source / system Use of Predictive Optimization
systems/uc-managed-tables Default-on managed-table property; "up to 20× faster queries and 50% lower storage costs" (cited via UC managed-table page from earlier disclosure).
systems/liquid-clustering Refreshes the cluster-key layout automatically; gains the CLUSTER BY AUTO option (key selection from observed query patterns).
systems/uc-otel-trace-tables Substrate property; auto-liquid-clustered + Predictive-Optimization-managed.
sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco First wiki canonicalisation as a dedicated system page; 22% average improvement disclosed.

Promotion note

Before this page existed, "Predictive Optimization" appeared as a tag on six systems and concepts pages (catalog-managed-commits, external-engine-write-to-managed-table, delta-kernel, delta-lake, liquid-clustering, uc-managed-tables) but had no dedicated page. The 2026-05-27 BI Serving Pointers source quotes the verbatim mechanism (auto-OPTIMIZE / VACUUM / stats collection inline with Photon writes + back-fill for existing tables + 22% average gain), which justified promotion.

Seen in

  • sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tcofirst wiki disclosure of Predictive Optimization as a distinct named system (previously implicit). Names the three operation classes (OPTIMIZE / VACUUM / stats collection), the two statistics planes (data-skipping vs query optimizer), the Photon-write-time inline collection, the existing-table back-fill, the CLUSTER BY AUTO Liquid Clustering integration, the 22% average performance number, and the catalog-level enablement contract. Reserved for future ingests: the predictive scheduler's decision criteria, per-table opt-out semantics, the back-fill pacing under high table count, the relationship to OPTIMIZE ZORDER BY (which the article presents as superseded by liquid clustering + Predictive Optimization), and the "observed workloads" corpus behind the 22% figure.
  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo — Predictive Optimization's load-bearing role at PB scale: the headline 7.7× speedup on Arctic Wolf's 3.8 PB security telemetry table is attributed to "Liquid Clustering on Unity Catalog managed tables with Predictive Optimization" — the three substrate properties working together. Direct disclosures: (a) OPTIMIZE planning improved 12h → 23m on 10 PB tables; (b) OPTIMIZE execution 5× faster on Medium DBSQL clusters; (c) automatic clustering maintenance is the property that makes data freshness improve from "hours to minutes" after migration to Liquid Clustering. The post canonicalises the previously- implicit prescription against OPTIMIZE ZORDER BY — that legacy layout maintenance technique is replaced by Liquid Clustering with Predictive-Optimization-managed incremental clustering on write.
Last updated · 542 distilled / 1,571 read