CONCEPT Cited by 2 sources
Automatic table optimization¶
Definition¶
Automatic table optimization is the substrate property where
the data platform — not the user — owns scheduling and execution
of compaction (OPTIMIZE), retention cleanup (VACUUM), and
statistics collection on managed tables. The user expresses
intent (which tables, which clustering keys); the platform
handles execution (when to run, on which subset, with which
intensity), using workload patterns to predict where the
maintenance cost is justified by the expected query speedup.
The defining contrast: in the user-managed model, the user
schedules OPTIMIZE jobs in Airflow / dbt / cron, runs VACUUM
on a retention schedule, and reruns ANALYZE to refresh stats.
Each task is its own pipeline with its own monitoring and its own
failure modes. In the automatic model, the substrate runs
all three based on observed workload, back-fills stats for
existing tables, and removes the maintenance pipelines from the
user's responsibility surface.
The 2026-05-27 Databricks BI Serving Pointers source canonicalises this concept on the wiki via Predictive Optimization: "Predictive Optimization automatically runs OPTIMIZE, VACUUM, and statistics collection on tables that would benefit from these operations — so you don't need to schedule these jobs yourself."
What gets automated¶
Three categories of table-maintenance work, each with a distinct purpose:
| Operation | What it does | Without automation |
|---|---|---|
Compaction (OPTIMIZE) |
Merges small files into larger ones; rewrites file layout to align with clustering keys. | Scheduled job on a fixed cadence; small-file pressure builds between runs. |
Retention cleanup (VACUUM) |
Removes unreferenced files past the retention window. | Scheduled job; stale-file accumulation between runs increases storage cost and metadata-listing overhead. |
| Statistics collection | Computes / refreshes data-skipping statistics (per-file min/max/null) and query-optimizer statistics (cardinality, distribution). | ANALYZE runs scheduled or triggered manually; stats drift between runs and the optimiser falls back to default heuristics. |
The architectural value: each of these is a substrate-shape optimisation — the user has no domain knowledge that informs when to run them better than the substrate does. Automating them moves the work to where the information lives (the platform's view of write-path activity + query patterns).
The two statistics planes¶
The Databricks source distinguishes two types of statistics that both get collected automatically:
- Data-skipping statistics — per-file min/max/null counts on configured columns, stored in the Delta transaction-log entries. Used by the file-listing path to skip files whose stat range cannot satisfy the query predicate. Saves I/O.
- Query-optimizer statistics — table / column / partition cardinalities and value distributions used by the cost-based optimiser. Saves plan-quality (better join order, broadcast-vs-shuffle decisions, filter push-down ordering).
Generalised at concepts/optimizer-statistics-as-skipping-substrate. Both classes are computed inline during write-path execution (in Databricks' case, "during Photon writes"), so freshly-written data has fresh stats.
Why "predictive"¶
The substrate doesn't blanket-run maintenance on every table on a fixed schedule. It runs maintenance on tables that would benefit based on observed workload patterns:
- High-write-churn tables get more frequent compaction.
- Frequently-queried filter columns get statistics maintained more aggressively.
- Tables that have never been queried since
VACUUMlast ran don't need anotherVACUUM.
This is the "on tables that would benefit from these operations" clause in the source. The cost of maintenance is amortised against the expected query-time savings.
The same mechanism extends to layout decisions: with
CLUSTER BY AUTO (a Predictive-Optimization-driven option on
Liquid Clustering), the substrate
chooses the clustering keys based on observed query patterns
— turning clustering-key selection from an architect-decision-at-
table-creation into a workload-driven runtime adaptation.
The compounding shape on BI workloads¶
The Databricks source argues automatic optimisation has outsized value on BI workloads because BI is filter-heavy and repetitive: "For BI workloads with repetitive filter patterns, the impact is especially significant — better statistics mean better data skipping and more efficient query plans."
Three compounding mechanisms:
- The same filter predicates run thousands of times — every data-skipping decision compounds across query volume.
- Star-schema joins have a small number of join shapes — the optimiser-statistics improvements compound across all queries that hit the same shape.
- New data lands continuously — without automated stats collection, stats drift and the optimiser falls back to default heuristics that are particularly wrong for filter-heavy workloads.
The disclosed performance envelope: "average 22% performance improvement" in observed workloads.
Sibling: substrate-owned-optimization-as-managed-table-property¶
Automatic table optimization is one of several substrate-owned optimisation properties of managed tables:
- Auto-
OPTIMIZE/ auto-VACUUM/ auto-stats (this concept). - Automatic liquid clustering — the substrate maintains the file layout to honour clustering keys.
- Always-on metadata caching — "reducing cloud storage requests and speeding up query planning".
The architectural shape: the substrate takes ownership of everything except intent. The user owns which tables exist, which columns they hold, and which clustering keys matter. The substrate owns how to keep those tables performant. See patterns/managed-table-as-default-storage-layer for the deployment pattern.
Sibling shapes elsewhere on the wiki¶
| Domain | Sibling shape |
|---|---|
| Container filesystem | concepts/lazy-loading-container-filesystem — the substrate decides which layers to fetch, not the application. |
| Vector indexes | systems/spfresh / concepts/spfresh-index — the index maintains itself as data changes. |
| Stream processing | Spark AQE — the optimiser re-plans during execution using runtime statistics. |
| Predictive caching | concepts/preheating — the substrate warms caches before traffic arrives. |
The shared principle: information about what to optimise lives in the substrate's view of the workload, not in the user's hands. Push the decision to where the information is.
Failure modes¶
- Workload-blind maintenance. If the predictive scheduler misjudges a table's value (e.g. a low-traffic table that happens to back a critical dashboard), maintenance can lag. Mitigation: explicit per-table optimisation directives as an override surface.
- Stats backfill cost. "Back-fills stats for existing tables" — the back-fill itself can be expensive on a large existing fleet at first enablement. Mitigation: rate- pacing the back-fill (the source doesn't document the pacing).
- Hidden cost accounting. When maintenance runs as a
substrate property, the per-table cost of maintenance gets
embedded in the platform's cost rather than itemised in the
user's pipeline cost — making it harder to attribute and
optimise. Mitigation: telemetry exposure (the source mentions
system.billing.usagebut does not specifically address maintenance-cost attribution).
What this is not¶
- Not "no maintenance". The maintenance still happens; the user just doesn't schedule it. The substrate consumes compute to run it.
- Not user-disabled-able for free. Automatic optimisation is the default property of managed tables; turning it off means the user takes back the responsibility (and the failure modes that come with it).
- Not specific to Databricks. The architectural shape — substrate-owned compaction / vacuum / stats — appears across modern lakehouse engines. This page canonicalises the shape using the Predictive-Optimization disclosure as the wiki's reference instance.
Seen in¶
- sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco
— first wiki canonicalisation as a concept via
Predictive Optimization on UC managed tables. Names the
three operation classes, the inline-during-Photon-writes
collection, the existing-table back-fill, the
CLUSTER BY AUTOextension to Liquid Clustering, and the 22% average improvement figure. Reserved for future ingests: the predictive scheduler's decision criteria, per-table opt-out semantics, the back-fill pacing under high table count, and cost-attribution surfaces for maintenance compute. - sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo — Discloses the engineering improvements to OPTIMIZE itself that make automatic table optimisation viable at PB scale: planning phase 12h → 23m on 10 PB tables (31×), execution phase 5× faster on Medium DBSQL clusters. The "on tables that would benefit" predictive-scheduler clause matters more at PB scale precisely because the per-cycle cost is non-trivial — picking the right tables to maintain is more load-bearing than at GB scale. Arctic Wolf's 3.8 PB telemetry table is the worked example of automatic optimisation paying off at this scale (file count 4M → 2M; data freshness hours → minutes post-migration to Liquid + Predictive Optimization on UC managed tables).
Related¶
- systems/databricks-predictive-optimization — canonical instance.
- systems/uc-managed-tables — the substrate this property applies to.
- systems/liquid-clustering — composes via
CLUSTER BY AUTO. - systems/delta-lake — the underlying table format whose statistics live in the transaction log.
- concepts/optimizer-statistics-as-skipping-substrate — the generalised skipping-statistics principle.
- patterns/managed-table-as-default-storage-layer — the deployment pattern that makes automatic optimisation the default.