Skip to content

CONCEPT Cited by 1 source

Multi-language materialized view

Definition

A multi-language materialized view is an MV whose definition can be authored in more than one language — typically SQL plus a general-purpose programming language (Python in Enzyme's case) — while the IVM engine treats both definitions as semantically equivalent units of incrementally maintainable computation.

The user-facing contract:

# Python MV
@dp.materialized_view
def order_report():
    return (
        spark.readTable("customer_and_order_table")
             .groupBy("region")
             .agg(F.sum("orders"))
    )
-- SQL MV
CREATE MATERIALIZED VIEW order_report AS
SELECT region, SUM(orders)
FROM customer_and_order_table
GROUP BY region;

Both forms are accepted by the IVM engine and result in the same incremental-maintenance behaviour.

Why this is hard

"While most industry solutions just focus on SQL, Enzyme supports MVs specified in Python as well. Python is now the language of choice for most data engineering and AI workloads. Enzyme solves many interesting challenges that multi-language support entails such as accurately detecting changes in MV definition."sources/2026-05-29-databricks-databricks-at-sigmod-2026

Two distinct difficulties layer into multi-language MV support:

1. Translating to a common logical plan

The IVM engine reasons about the MV as a logical plan — a tree of relational operators (Project, Filter, Join, Aggregate, Window, …) that the incrementalisation algorithm understands. SQL maps onto this plan via a parser and analyser; Python (specifically PySpark DataFrame code) maps onto it via the Spark PySpark-to-logical-plan pipeline, which translates DataFrame method chains into the same Catalyst logical-plan IR used for SQL.

Once both languages produce the same IR, the IVM engine's incrementalisation logic is language-agnostic. The hard part is ensuring the translation captures all the semantics relevant to incremental maintenance — including the function's reliance on non-deterministic functions, its closure over outer-scope variables, and its use of helper functions defined elsewhere in the codebase.

2. Change detection on the MV definition

When a user edits an MV's definition, the IVM engine must determine whether cached intermediate results remain valid. If the definition has changed only cosmetically (whitespace, a renamed local variable), the cache is still valid. If the definition has changed semantically (a join condition, a filter predicate), the cache must be invalidated.

Language Change-detection technique
SQL Parse to AST → canonicalise → hash. Two SQL strings that produce the same canonical AST are semantically equivalent. Reasonably tractable.
Python Hash of source-code bytes is too sensitive (whitespace breaks it). AST hash is closer but still false-positive-prone (variable renames). True semantic equivalence requires program analysis: closure tracking, helper-function dependency closure, import resolution, branch reachability. Hard.

The blog calls this out as one of the "interesting challenges that multi-language support entails". Enzyme's specific change-detection mechanism is not disclosed in the source.

Plausible techniques (none confirmed by the source):

Technique What it does Limitation
AST canonicalisation + hash Parse Python to AST, normalise, hash. Misses semantic-preserving refactors that change AST shape.
Bytecode hash Hash compiled bytecode. Sensitive to compiler-version differences; brittle across Python upgrades.
Plan-IR hash Translate the function to its Spark logical plan, hash the plan. Robust against most Python-level edits that don't change the plan; the canonical approach for engines that already produce the IR.
Dependency-closure hash Plan hash + transitive hash of all referenced helpers. Complete; expensive to compute at edit time.

What multi-language MVs unlock

Use case Why Python is needed
Complex transformations DataFrame chains with intermediate variables, conditional branches, and helper functions are clearer in Python than as nested SQL.
AI / ML pipelines Model loading, feature extraction, embedding computation are Python-native.
Reuse of Python data libraries NumPy, pandas, scikit-learn, PyTorch — the data-engineering Python ecosystem.
Type-checked transformations pyspark.sql.types + mypy give static checking that SQL alone does not.

Seen in

  • sources/2026-05-29-databricks-databricks-at-sigmod-2026 — first wiki disclosure of multi-language MV support as an explicit industrial-IVM novel claim. Enzyme supports Python and SQL; "most industry solutions just focus on SQL"; the named open challenge solved is change detection on Python MV definitions.
Last updated · 542 distilled / 1,571 read