CONCEPT Cited by 1 source

Incremental View Maintenance (IVM)¶

Definition¶

Incremental view maintenance (IVM) is the database technique of keeping a materialized view up to date by applying only the deltas induced by changes to its inputs, rather than recomputing the view from scratch.

Given a view definition V = f(T1, T2, ..., Tn) and a set of input deltas ΔT1, ΔT2, ..., ΔTn, IVM computes the corresponding output delta ΔV and applies it to the persisted view, without re-running f over the full input tables.

The classical formulation, for an aggregation MV, looks like:

V       = SELECT region, SUM(orders) FROM customer_and_order_table GROUP BY region
ΔT      = newly-inserted rows in customer_and_order_table
ΔV      = SELECT region, SUM(orders) FROM ΔT GROUP BY region    -- compute over delta only
V_new   = V ⊕ ΔV                                                 -- merge into MV

The mechanism gets dramatically harder as the view shape moves from simple aggregations to:

Multi-table joins — semi-naive evaluation, side-table maintenance.
Window functions — partition recomputation, watermarking.
Mixtures of join + window + aggregation in one MV.
Non-deterministic functions — see concepts/non-deterministic-mv-maintenance.
Updates and deletes in the input, not just inserts — retraction handling.

Why it matters: two distinct value propositions¶

1. Query acceleration (the traditional MV use case)¶

A dashboard reads a pre-aggregated table that's smaller and faster to scan than the raw fact table. Without IVM, the MV must be periodically rebuilt — expensive, stale between rebuilds. With IVM, the MV is kept fresh continuously, with cost proportional to the delta, not the full table.

2. ETL replacement (the SDP / Enzyme thesis)¶

"Materialized views (MVs) are popular for query acceleration — speeding up dashboards on data residing in data warehouses. When creating Spark Declarative Pipelines, we decided to go beyond query acceleration and apply materialized views to the extract-transform-load (ETL) use cases. Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." — Databricks at SIGMOD 2026

The thesis: declarative MVs replace hand-written incremental ETL code if the IVM engine is general enough to cover production MV shapes. The data engineer writes what the derived table should contain (a SQL or Python view definition); the IVM engine handles how it stays in sync — merges, upserts, backfills, late-arriving data — all of which would otherwise be hand-coded.

The MV-grammar coverage axis¶

Industrial IVM systems vary widely in which MV shapes they support incrementally. Coverage is the primary value differentiator.

Coverage axis	Limited industrial IVM	Full industrial IVM (e.g. Enzyme claims)
Single-table aggregations	✅	✅
Two-table joins	partial	✅
≥3-table joins	sometimes rejected	✅
Window functions	recomputed in full	✅ incrementally
Combinations (join + window + agg)	❌	✅
Non-deterministic funcs (`current_date()`, AI)	rejected or full recompute	✅ (per Enzyme)
Multi-language MV definition	SQL only	Python + SQL (concepts/multi-language-materialized-view)

Enzyme's industrial contribution is moving the production frontier on every row of this table; see the [[sources/2026-05-29-databricks-databricks-at-sigmod-2026|2026-05-29 SIGMOD source]].

Core challenges¶

Determinism¶

Standard IVM relies on delta_in → delta_out being a function of the delta alone, not of when it is computed. Non-deterministic functions (current_date(), randomised UDFs, AI/LLM calls) break this; see concepts/non-deterministic-mv-maintenance.

Update / delete handling (retraction)¶

For aggregations: tracking sufficient state to undo a row's contribution when it is deleted (e.g. count + sum to support avg under deletes; sketches for distinct-count under deletes).

Incrementalisation strategy choice¶

Per refresh, the engine must choose between:

Full recompute — cheapest if the input change is dense or the MV is small.
Partition-level rewrite — cheapest when changes are localised to a few partitions.
Row-level update — cheapest when changes are sparse and the MV supports per-row deltas.

This decision is workload-dependent and run-dependent, motivating cost-model-driven strategy selection.

Change detection on the MV definition¶

When the MV definition itself changes (the user edits the SQL or the Python function), the IVM engine must determine whether cached intermediate results remain valid. Easy for SQL; hard for Python — see concepts/multi-language-materialized-view.

Relationship to Change Data Capture¶

CDC supplies the input deltas to an IVM engine — they are the inputs ΔT1, ..., ΔTn to the IVM algorithm. CDC is the delivery mechanism for input changes; IVM is the transformation that turns those input changes into output-MV changes. Delta Change Data Feed is one such CDC source on the lakehouse.

Seen in¶

sources/2026-05-29-databricks-databricks-at-sigmod-2026 — Databricks at SIGMOD 2026; first wiki disclosure of Enzyme as the IVM engine behind SDP's @dp.materialized_view. Names four novel claims: full grammar coverage, non-deterministic functions, multi-language MVs, cost-model-driven strategy. Argues IVM is an ETL primitive, not just a query-acceleration primitive.

concepts/materialized-view — the artifact IVM maintains.
concepts/non-deterministic-mv-maintenance — the hardest IVM challenge.
concepts/multi-language-materialized-view — the language-extension axis.
concepts/change-data-capture — the input-delta delivery mechanism.
concepts/delta-change-data-feed — Delta Lake's CDC source for IVM inputs.
systems/enzyme-ivm — Databricks' IVM engine.
systems/lakeflow-spark-declarative-pipelines — the surface that exposes Enzyme to users.
patterns/cost-model-driven-incrementalization-strategy — strategy selection at runtime.