CONCEPT Cited by 1 source
Incremental View Maintenance (IVM)¶
Definition¶
Incremental view maintenance (IVM) is the database technique of keeping a materialized view up to date by applying only the deltas induced by changes to its inputs, rather than recomputing the view from scratch.
Given a view definition V = f(T1, T2, ..., Tn) and a set of input
deltas ΔT1, ΔT2, ..., ΔTn, IVM computes the corresponding output
delta ΔV and applies it to the persisted view, without re-running
f over the full input tables.
The classical formulation, for an aggregation MV, looks like:
V = SELECT region, SUM(orders) FROM customer_and_order_table GROUP BY region
ΔT = newly-inserted rows in customer_and_order_table
ΔV = SELECT region, SUM(orders) FROM ΔT GROUP BY region -- compute over delta only
V_new = V ⊕ ΔV -- merge into MV
The mechanism gets dramatically harder as the view shape moves from simple aggregations to:
- Multi-table joins — semi-naive evaluation, side-table maintenance.
- Window functions — partition recomputation, watermarking.
- Mixtures of join + window + aggregation in one MV.
- Non-deterministic functions — see concepts/non-deterministic-mv-maintenance.
- Updates and deletes in the input, not just inserts — retraction handling.
Why it matters: two distinct value propositions¶
1. Query acceleration (the traditional MV use case)¶
A dashboard reads a pre-aggregated table that's smaller and faster to scan than the raw fact table. Without IVM, the MV must be periodically rebuilt — expensive, stale between rebuilds. With IVM, the MV is kept fresh continuously, with cost proportional to the delta, not the full table.
2. ETL replacement (the SDP / Enzyme thesis)¶
"Materialized views (MVs) are popular for query acceleration — speeding up dashboards on data residing in data warehouses. When creating Spark Declarative Pipelines, we decided to go beyond query acceleration and apply materialized views to the extract-transform-load (ETL) use cases. Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." — Databricks at SIGMOD 2026
The thesis: declarative MVs replace hand-written incremental ETL code if the IVM engine is general enough to cover production MV shapes. The data engineer writes what the derived table should contain (a SQL or Python view definition); the IVM engine handles how it stays in sync — merges, upserts, backfills, late-arriving data — all of which would otherwise be hand-coded.
The MV-grammar coverage axis¶
Industrial IVM systems vary widely in which MV shapes they support incrementally. Coverage is the primary value differentiator.
| Coverage axis | Limited industrial IVM | Full industrial IVM (e.g. Enzyme claims) |
|---|---|---|
| Single-table aggregations | ✅ | ✅ |
| Two-table joins | partial | ✅ |
| ≥3-table joins | sometimes rejected | ✅ |
| Window functions | recomputed in full | ✅ incrementally |
| Combinations (join + window + agg) | ❌ | ✅ |
Non-deterministic funcs (current_date(), AI) |
rejected or full recompute | ✅ (per Enzyme) |
| Multi-language MV definition | SQL only | Python + SQL (concepts/multi-language-materialized-view) |
Enzyme's industrial contribution is moving the production frontier on every row of this table; see the [[sources/2026-05-29-databricks-databricks-at-sigmod-2026|2026-05-29 SIGMOD source]].
Core challenges¶
Determinism¶
Standard IVM relies on delta_in → delta_out being a function of
the delta alone, not of when it is computed. Non-deterministic
functions (current_date(), randomised UDFs, AI/LLM calls) break
this; see concepts/non-deterministic-mv-maintenance.
Update / delete handling (retraction)¶
For aggregations: tracking sufficient state to undo a row's
contribution when it is deleted (e.g. count + sum to support avg
under deletes; sketches for distinct-count under deletes).
Incrementalisation strategy choice¶
Per refresh, the engine must choose between:
- Full recompute — cheapest if the input change is dense or the MV is small.
- Partition-level rewrite — cheapest when changes are localised to a few partitions.
- Row-level update — cheapest when changes are sparse and the MV supports per-row deltas.
This decision is workload-dependent and run-dependent, motivating cost-model-driven strategy selection.
Change detection on the MV definition¶
When the MV definition itself changes (the user edits the SQL or the Python function), the IVM engine must determine whether cached intermediate results remain valid. Easy for SQL; hard for Python — see concepts/multi-language-materialized-view.
Relationship to Change Data Capture¶
CDC supplies the input deltas to an
IVM engine — they are the inputs ΔT1, ..., ΔTn to the IVM
algorithm. CDC is the delivery mechanism for input changes; IVM is
the transformation that turns those input changes into output-MV
changes. Delta Change Data Feed
is one such CDC source on the lakehouse.
Seen in¶
- sources/2026-05-29-databricks-databricks-at-sigmod-2026 —
Databricks at SIGMOD 2026; first wiki disclosure of Enzyme
as the IVM engine behind SDP's
@dp.materialized_view. Names four novel claims: full grammar coverage, non-deterministic functions, multi-language MVs, cost-model-driven strategy. Argues IVM is an ETL primitive, not just a query-acceleration primitive.
Related¶
- concepts/materialized-view — the artifact IVM maintains.
- concepts/non-deterministic-mv-maintenance — the hardest IVM challenge.
- concepts/multi-language-materialized-view — the language-extension axis.
- concepts/change-data-capture — the input-delta delivery mechanism.
- concepts/delta-change-data-feed — Delta Lake's CDC source for IVM inputs.
- systems/enzyme-ivm — Databricks' IVM engine.
- systems/lakeflow-spark-declarative-pipelines — the surface that exposes Enzyme to users.
- patterns/cost-model-driven-incrementalization-strategy — strategy selection at runtime.