PATTERN Cited by 1 source
Cost-model-driven incrementalization strategy¶
Pattern summary¶
When an IVM engine refreshes a materialized view, it has multiple viable update strategies. No single strategy dominates across all workloads — the cheapest choice is workload- and run-dependent. The pattern: at each refresh, a cost model fed by both static plan information and dynamic prior-execution statistics chooses the incrementalisation strategy per MV per run.
"It uses a cost model that leverages plan information and prior executions to determine the most efficient incrementalization strategy." — Databricks at SIGMOD 2026
Strategy axes the cost model chooses across¶
Axis 1: update granularity — partition-level vs row-level¶
"Techniques that automatically determine if updates should be applied at partition level instead of row level thus reducing rewrite overheads."
| Strategy | When it wins |
|---|---|
| Partition-level rewrite | Affected fraction of rows in a partition is high (no per-row bookkeeping; one wholesale rewrite). |
| Row-level update | Affected rows are sparse within partitions (no full-partition rewrite cost). |
Misapplication is expensive: partition-level rewrite for sparse changes wastes IO on unchanged rows; row-level update for dense changes pays per-row metadata overhead. The cost model picks per run.
Axis 2: intermediate-result caching¶
"It selectively caches intermediate results to reduce IO costs."
For an MV with multiple stages (e.g. aggregate(join(filter(T1, T2)))),
the IVM engine can cache the output of any stage as a side table.
Cached intermediate results accelerate subsequent refreshes when
their inputs are unchanged. The cost model decides which
intermediates to cache based on:
- Cost to recompute (high → favour caching).
- Cache storage cost (high → disfavour caching).
- Reuse frequency (high → favour caching).
Axis 3 (implied): full refresh fallback¶
When changes are dense enough that incremental computation costs more than full recomputation, the cost model can fall back to a full refresh. Not explicitly named in the source but a standard member of the strategy axis.
Cost-model inputs¶
Two complementary input streams:
| Input | Source | Examples |
|---|---|---|
| Plan information (static) | The MV's logical / physical plan. | Number of joins; join cardinality estimates; window size; aggregation arity. |
| Prior execution statistics (dynamic) | Runtime stats collected on past refreshes. | Actual runtime per stage; cache hit rates; partition-rewrite cost ratios; data-skew observed. |
The pattern requires the IVM engine to persist runtime statistics across refreshes — a non-trivial engineering investment, but the basis for adaptive strategy selection.
Why this is a pattern, not just an optimisation¶
It is a structural choice in the engine's architecture: the engine separates strategy selection from strategy execution, and the strategy-selection layer is fed by both static and dynamic inputs. Engines without this pattern hard-code a strategy per MV shape (or per system); the pattern is the architectural commitment to adaptive, per-run choice.
Sibling instances of cost-model-driven adaptivity¶
The same architectural shape — separate strategy-selection layer fed by plan + prior-execution stats — recurs elsewhere in the wiki:
- Spark Adaptive Query Execution (AQE) — the same engine, applied to query execution: AQE uses runtime statistics from completed shuffles to re-plan downstream stages (join strategy, partition counts). Enzyme can be read as the AQE philosophy applied to MV maintenance.
- Optimiser statistics as skipping substrate — Databricks' Predictive Optimization uses statistics for data-skipping decisions; the same plan + runtime-stats input structure.
- Cost-based query optimisers generally — System R's pioneering work, modern PostgreSQL / MySQL / Snowflake optimisers — all instantiate this pattern at the query-planning layer.
Implementation guidance¶
To adopt the pattern in an IVM engine:
- Define the strategy enum — at minimum: full-refresh, partition-level, row-level. Optional: cache-intermediate-N.
- Persist a per-MV statistics ledger — refresh time, rows processed, cache hit rates, partition-rewrite ratios.
- Build a cost function — input: plan + ledger; output: chosen strategy + estimated cost.
- Provide an override mechanism — let users force a strategy when the cost model gets it wrong (operator escape hatch).
- Log strategy choices — observability surface for debugging regressions.
Caveats¶
- Cost model accuracy is workload-dependent. The model trains itself from history; cold-start and workload-shift cases are when the pattern misfires.
- Strategy execution must remain correct under all choices — the IVM result must not depend on which strategy was selected, only the cost.
- Persisting runtime statistics has its own cost; the ledger must not become a hot spot.
Seen in¶
- sources/2026-05-29-databricks-databricks-at-sigmod-2026 — first wiki disclosure of the pattern as an explicit IVM-engine contribution. Enzyme "uses a cost model that leverages plan information and prior executions to determine the most efficient incrementalization strategy"; named axes are partition-level vs row-level updates and selective intermediate-result caching.
Related¶
- concepts/incremental-view-maintenance — the parent technique.
- concepts/materialized-view — what is being maintained.
- systems/enzyme-ivm — the engine that uses this pattern.
- systems/lakeflow-spark-declarative-pipelines — the surface Enzyme powers.
- systems/spark-aqe — sibling instance of the same architectural shape, applied to query execution rather than IVM.
- concepts/optimizer-statistics-as-skipping-substrate — sibling use of plan + runtime-stats input structure for data skipping.