Skip to content

SYSTEM Cited by 1 source

Enzyme (Databricks IVM engine)

Enzyme is the incremental-view-maintenance (IVM) engine that sits behind the @dp.materialized_view decorator in Spark Declarative Pipelines (SDP). It takes a user-defined materialized view (in SQL or Python) and automatically keeps it up to date as the underlying tables change, "hiding all the complexity of incremental processing" from the MV author. Published as the honorable-mention award paper at SIGMOD 2026 (arXiv:2603.27775, "Enzyme: Incremental View Maintenance for Data Engineering", presented by Ritwik Yadav).

Disambiguation: there is also a wiki page Enzyme (React testing utility). The two systems share a name only. Databricks Enzyme is a database query-engine layer; Airbnb's Enzyme is a JavaScript React-testing library. This page covers the Databricks IVM engine.

Position in the SDP architecture

SDP exposes two ways to author incremental data pipelines, and "customers can mix-and-match these within a pipeline":

                      SDP pipeline definition
         ┌────────────────────┴────────────────────┐
         ▼                                         ▼
  @dp.materialized_view                  @dp.table (streaming)
         │                                         │
         ▼                                         ▼
  ┌─────────────┐                         ┌──────────────────┐
  │  Enzyme     │                         │ Structured       │
  │  IVM engine │                         │ Streaming engine │
  │ (this page) │                         │ (VLDB 2026 paper)│
  └─────────────┘                         └──────────────────┘

The MV track (Enzyme) suits authors who want to describe what the result should be; the streaming track (Structured Streaming) suits authors who want to express how state evolves — stateful operators, watermarks, custom aggregations, etc.

What Enzyme is

A query-engine layer on top of Apache Spark that, given a materialized-view definition and a stream of input deltas, computes and applies the corresponding output deltas to the materialized view without rerunning the full MV computation.

The user-facing contract:

import pyspark.pipelines as dp

@dp.materialized_view
def order_report():
    return (
        spark.readTable("customer_and_order_table")
             .groupBy("region")
             .agg(F.sum("orders"))
    )

or equivalently in SQL:

CREATE MATERIALIZED VIEW order_report AS
SELECT region, sum(orders)
FROM customer_and_order_table
GROUP BY region

As new orders arrive in customer_and_order_table, Enzyme keeps order_report "up to date" without the user writing any merge, upsert, or backfill code.

The four novel claims

The 2026-05-29 source page disclosed four contributions that distinguish Enzyme from prior industrial IVM systems:

1. Full MV-grammar coverage (joins + windows + aggregations + combinations)

"Enzyme incrementally maintains complex MVs in production including those with joins, window functions, aggregations, and their combinations."

IVM literature traditionally publishes per-shape algorithms (delta maintenance for aggregations, semi-naive maintenance for joins, window-function reformulation as aggregation). The industrial contribution Enzyme claims is that MVs combining all three shapes are incrementally maintained by a single engine — e.g. an MV that joins three tables, windows by user-id, and aggregates via sum().

Captured on IVM as the MV-grammar coverage axis.

2. Non-deterministic functions: current_date() and AI functions

"Unlike other industry solutions, Enzyme also supports non-deterministic functions such as current_date() and AI specific functions."

This is the most architecturally interesting claim. Standard IVM relies on the determinism of the MV definition: delta_in → delta_out is the same delta whether it is computed now or an hour from now, which lets the engine recompute only over the input delta and apply the result to the persisted MV. Non-deterministic functions break this invariant:

  • current_date() evaluates differently on every run.
  • An AI function (ai_query, ai_classify, etc.) evaluates differently for the same input string at different times — model versions change, sampling adds randomness, retrieval-augmented invocations depend on the corpus state at call time.

Most industry IVM systems either (a) reject MVs that reference non-deterministic functions, or (b) recompute affected MVs in full on every refresh. Enzyme claims correctness under incremental maintenance. The blog post does not disclose the mechanism; the non-deterministic MV maintenance concept page enumerates plausible techniques.

3. Multi-language MVs: Python + SQL

"While most industry solutions just focus on SQL, Enzyme supports MVs specified in Python as well. Python is now the language of choice for most data engineering and AI workloads. Enzyme solves many interesting challenges that multi-language support entails such as accurately detecting changes in MV definition."

Two distinct difficulties layered into this claim:

  • MV semantics in Python. A Python MV is a function whose body uses the PySpark DataFrame API — the engine must translate the Python function to a logical plan that the IVM analyser can reason about (likely via Catalyst's existing PySpark-to-logical-plan pipeline).
  • Change detection. The engine must determine whether a Python MV's definition has changed in a way that invalidates cached intermediate results. SQL MVs admit relatively easy change detection (text diff, AST canonicalisation against a known grammar). Python MVs admit arbitrary control flow, helper-function calls, and external imports — change detection becomes a program-analysis problem. The blog calls this out as one of the "interesting challenges" Enzyme solves.

Captured on concepts/multi-language-materialized-view.

4. Cost-model-driven incrementalisation strategy

"Enzyme has multiple optimizations to reduce the amount of data that needs to be processed including techniques that automatically determine if updates should be applied at partition level instead of row level thus reducing rewrite overheads. It selectively caches intermediate results to reduce IO costs. It uses a cost model that leverages plan information and prior executions to determine the most efficient incrementalization strategy."

Three sub-mechanisms named:

Mechanism What it does Why it matters
Partition-level vs row-level update selection Per refresh, choose whether to rewrite affected partitions wholesale or apply per-row deltas. Partition-level rewrite is cheaper when the affected fraction of rows in a partition is high (no per-row bookkeeping); row-level update is cheaper when changes are sparse (no whole-partition rewrite cost).
Selective intermediate-result caching Cache the joins or aggregations whose recomputation cost exceeds storage cost. Reduces IO cost for repeated downstream consumers.
Cost model fed by plan information + prior executions Both static (plan shape, estimated cardinalities) and dynamic (actual runtime stats from past runs) inputs. Lets the engine adapt strategy as workloads evolve.

Captured as patterns/cost-model-driven-incrementalization-strategy.

Performance disclosure

The single performance figure in the post is a relative-speedup chart claiming "Enzyme has significantly better performance than another competing industry solution (name anonymized to CV-IVM due to licensing restrictions)." Absolute numbers, workload axes, and ablations are deferred to the arXiv paper.

The licensing-anonymisation tells the reader a benchmark exists in the paper but the named competitor's EULA forbids public benchmark publication; this is a common shape in DBMS research.

Why ETL, not just dashboards

"Materialized views (MVs) are popular for query acceleration — speeding up dashboards on data residing in data warehouses. When creating Spark Declarative Pipelines, we decided to go beyond query acceleration and apply materialized views to the extract-transform-load (ETL) use cases. Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code."

The architectural thesis: declarative MVs replace hand-written incremental ETL code when the IVM engine is general enough to handle production MV shapes. This positions Enzyme as more than a dashboard accelerator — it is the substrate that lets SDP authors delete custom merge / upsert / backfill logic from their pipelines.

Open questions / not disclosed

  • Mechanism for non-deterministic-function correctness. The claim is made; the technique (timestamp pinning to a snapshot; per-row AI-function result caching with explicit invalidation; opt-out taint analysis on the call site) is not stated.
  • Python-MV change-detection technique. AST canonicalisation? Bytecode hash plus dependency closure? Symbolic execution? The source does not say.
  • Cost-model feature set. "Plan information and prior executions" is the only description; specific signals (cardinality estimates, partition statistics, prior-run timing, cache hit rates) are not enumerated.
  • Relationship to Catalyst. Enzyme is presumably a layer above or beside the Spark logical-plan optimiser; the integration model is not described in the blog post.
  • Workload-class boundaries. Coverage of joins of three or more tables, recursive CTEs, arbitrary UDFs other than current_date() and AI functions, and MVs over foreign-Iceberg vs managed-table inputs is not characterised.
  • CV-IVM identity. Anonymised under licensing restrictions; cross-vendor comparison from the blog post alone is not possible.

Seen in

Last updated · 542 distilled / 1,571 read