Databricks — Databricks at SIGMOD 2026¶
Summary¶
A short corporate-blog announcement (Tier-3 source) that nevertheless
discloses the first publicly named architecture of Databricks'
incremental-view-maintenance engine — Enzyme — which
powers the materialized-view track of Spark
Declarative Pipelines (SDP). The post announces two papers:
SIGMOD 2026 "Enzyme: Incremental View Maintenance for Data Engineering"
(arXiv:2603.27775; honorable-mention
award at the conference) and VLDB 2026 "A Decade of Apache Spark
Structured Streaming: How We Evolved the Architecture To Meet
Real-world Needs" — establishing the two-track incremental-processing
model in SDP: data engineers can author pipelines as either (a) declarative
materialized views that Enzyme keeps incrementally up to date, or (b) explicit
streaming pipelines authored against Structured
Streaming APIs (stateful operators, watermarks, custom aggregations), and
the two can be mixed within a single pipeline. The novel claims for
Enzyme over the prior IVM literature are: (1) support for the full
SDP MV grammar in production (joins, window functions, aggregations, and
combinations of all three); (2) non-deterministic functions —
current_date() and AI/LLM-invoking functions — handled correctly under
incremental maintenance, where most prior industry IVM systems either reject
them outright or recompute the affected MV in full; (3) multi-language
MVs — Python and SQL, in contrast to "most industry solutions [that] just
focus on SQL"; (4) a cost-model-driven incrementalisation strategy
that uses plan information plus prior execution statistics to choose between
partition-level vs row-level updates and to selectively cache intermediate
results, "reducing rewrite overheads" and "IO costs". A single
performance figure is disclosed: a relative-speedup chart claiming Enzyme
"has significantly better performance than another competing industry
solution (name anonymized to CV-IVM due to licensing restrictions)". No
absolute numbers, no workload dimensions, no ablations are given in the
blog post — those are deferred to the paper. Bangalore, India is named
as a "large Databricks R&D hub" and the venue (SIGMOD 2026, June 1–5)
where Databricks attends as Platinum Sponsor; Ritwik Yadav is named
as the Enzyme paper presenter.
Key takeaways¶
-
Enzyme is the IVM engine behind SDP's materialized-view decorator — first public naming on the wiki. "Data engineers can specify Materialized Views for transformations. The Enzyme engine incrementally maintains them as new data arrives. All the complexity of incremental processing is completely hidden from the creators of the materialized views." This canonicalises a substrate boundary that was previously elided in SDP documentation —
@dp.materialized_viewis the user-facing decorator; Enzyme is the engine that makes it work. The substrate's value proposition is "hiding all the complexity of incremental processing" from MV authors. New page: systems/enzyme-ivm. -
SDP is fundamentally two engines, not one: declarative MVs (Enzyme) and explicit streaming (Structured Streaming), mix-and-match within a pipeline. "There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline." The first track is the
@dp.materialized_viewpath — declarative SQL- or Python-defined views that Enzyme keeps incrementally up to date. The second is the streaming track — "Data engineers who are well versed in stream processing can instead use SDP's streaming engine to incrementally process data. The streaming APIs provide a wide variety of constructs — from stateful operators to watermarks, making it easy to express complicated business logic like custom aggregations." The two engines coexist behind a single SDP pipeline definition, letting authors choose the abstraction level per stage. This is a major extension to the SDP page and clarifies a previously-implicit architectural fact. -
Enzyme extends MV use beyond query acceleration into ETL — the articulated thesis is that incremental MVs are an ETL primitive. "Materialized views (MVs) are popular for query acceleration — speeding up dashboards on data residing in data warehouses. When creating Spark Declarative Pipelines, we decided to go beyond query acceleration and apply materialized views to the extract-transform-load (ETL) use cases. Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." This is the SDP thesis stated cleanly: declarative MVs replace hand-written incremental ETL code when the IVM engine is general enough to handle production MV shapes. Canonicalised as a key motivating axis on concepts/incremental-view-maintenance and as the ETL substitution claim for SDP.
-
Enzyme's first novel claim: full MV-grammar coverage including combinations. "Enzyme incrementally maintains complex MVs in production including those with joins, window functions, aggregations, and their combinations." IVM literature traditionally publishes per-shape algorithms (delta-based aggregation, semi-naive join maintenance, window-function reformulation); Enzyme's industrial contribution is production support for the cross-product — an MV that both joins multiple tables and aggregates over a window is single-shot incrementalisable. Captured on concepts/incremental-view-maintenance as the MV-grammar coverage axis.
-
Enzyme's second novel claim: incremental maintenance over non-deterministic functions, including
current_date()and AI functions. "Unlike other industry solutions, Enzyme also supports non-deterministic functions such ascurrent_date()and AI specific functions." This is the most architecturally interesting disclosure in the post. Standard IVM relies on the determinism of the MV definition:delta_in → delta_outis computed by re-running the MV logic over the delta, and the result is the same whether you compute it now or an hour from now. Non-deterministic functions break this invariant —current_date()evaluates differently on each run; an AI function evaluates differently for the same input string at different times (model versions, sampling, RAG retrieval freshness). Enzyme claims correctness under incremental maintenance for both. The blog post does not disclose the mechanism (likely candidates: explicit binding ofcurrent_date()to a snapshot timestamp; AI-function results cached per row-id with explicit invalidation policy; opt-out from incremental maintenance per non-deterministic-call site). New concept page: concepts/non-deterministic-mv-maintenance. -
Enzyme's third novel claim: multi-language MVs (Python + SQL), distinguished from SQL-only industry solutions. "While most industry solutions just focus on SQL, Enzyme supports MVs specified in Python as well. Python is now the language of choice for most data engineering and AI workloads. Enzyme solves many interesting challenges that multi-language support entails such as accurately detecting changes in MV definition." The named challenge is change-detection on the MV definition — for a SQL MV, change detection can be a textual or AST diff against a known grammar; for a Python MV, the function body can include arbitrary control flow, helper-function calls, and external imports, so determining whether a
@dp.materialized_view-decorated function has changed in a way that invalidates cached intermediate results is a non-trivial program-analysis problem. New concept page: concepts/multi-language-materialized-view. -
Enzyme's fourth novel claim: cost-based incrementalisation strategy — partition-level vs row-level updates chosen at runtime. "Enzyme has multiple optimizations to reduce the amount of data that needs to be processed including techniques that automatically determine if updates should be applied at partition level instead of row level thus reducing rewrite overheads. It selectively caches intermediate results to reduce IO costs. It uses a cost model that leverages plan information and prior executions to determine the most efficient incrementalization strategy." Three sub-mechanisms named: (a) granularity selection — partition-level rewrite vs row-level update, chosen automatically per MV per run; (b) selective intermediate-result caching — cache the joins or aggregations whose recomputation cost exceeds storage cost; (c) cost model fed by plan information + prior executions — both static (plan shape) and dynamic (runtime stats from past runs) inputs. New pattern page: patterns/cost-model-driven-incrementalization-strategy.
-
The single performance disclosure: relative speedup vs an anonymised competitor (CV-IVM). "Figure 1: Enzyme has significantly better performance than another competing industry solution (name anonymized to CV-IVM due to licensing restrictions)." No absolute numbers, no workload axes, no ablations — these are deferred to the paper. The licensing-anonymisation ("CV-IVM") tells the reader a real benchmark exists in the paper but the named competitor's EULA forbids public benchmark publication; this is a common shape in DBMS research. Captured as a caveat below.
-
Companion VLDB 2026 paper covers Spark Structured Streaming's decade-long architectural evolution. "Key ideas in our streaming product will appear in the VLDB 2026 paper 'A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs'." This is the first wiki disclosure that Spark Structured Streaming's architecture has been published as a first-author Databricks academic paper at VLDB 2026 — the paper itself is not yet available, but the blog confirms that the streaming side of the SDP two-track model has independent academic publication. Captured as a forward reference on systems/spark-streaming.
-
Bangalore is named as a "large Databricks R&D hub" and SIGMOD's 2026 host city. "SIGMOD will take place in Bangalore, India which is also a large Databricks R&D hub." This is incidental but the only company-geography disclosure on the wiki for Databricks India presence; useful as a recurring-shape datapoint when other Databricks sources reference India-hosted teams.
Architecture extracted¶
The SDP two-track model¶
┌──────────────────────────────────────────┐
│ Spark Declarative Pipelines (SDP) │
│ Single user-facing pipeline definition │
└──────────────────┬───────────────────────┘
│
┌──────────────────┴──────────────────┐
▼ ▼
┌────────────────────────┐ ┌────────────────────────────┐
│ @dp.materialized_view │ │ @dp.table (streaming) │
│ (declarative MV) │ │ + Structured Streaming │
│ │ │ APIs in pipeline body │
│ ┌────────────────┐ │ │ ┌─────────────────────┐ │
│ │ Enzyme IVM │ │ │ │ Structured │ │
│ │ engine │ │ │ │ Streaming engine │ │
│ │ │ │ │ │ - stateful operators│ │
│ │ - joins │ │ │ │ - watermarks │ │
│ │ - windows │ │ │ │ - custom aggregates │ │
│ │ - aggregations │ │ │ │ - micro-batch │ │
│ │ - combinations │ │ │ └─────────────────────┘ │
│ │ - non-determ. │ │ │ (VLDB 2026 paper) │
│ │ funcs │ │ │ │
│ │ - Python + SQL │ │ │ │
│ │ - cost model │ │ │ │
│ └────────────────┘ │ │ │
│ (SIGMOD 2026 paper) │ │ │
└────────────────────────┘ └────────────────────────────┘
▲ ▲
└────── mix-and-match within ────────┘
a single pipeline
Enzyme architectural axes (four claims)¶
| Axis | Industry-typical | Enzyme |
|---|---|---|
| MV grammar | Per-shape (joins OR aggregations OR windows) | Joins + windows + aggregations + combinations, all incrementally maintained in one engine |
| Determinism | MV must be deterministic; non-determ. → full recompute or rejection | current_date() and AI functions handled correctly under incremental maintenance |
| Languages | SQL only | Python + SQL; with change detection on Python MV definitions as a named open problem solved |
| Strategy selection | Per-MV static heuristic (partition-level only OR row-level only) | Cost-based runtime choice: partition-level vs row-level updates per run; selective intermediate-result caching; cost model fed by plan info + prior execution stats |
Toy MV example (verbatim from the post)¶
CREATE MATERIALIZED VIEW order_report as
SELECT region, sum(orders)
FROM customer_and_order_table
GROUP by region
The author's framing: "While keeping the above toy MV updated seems simple, imagine if the MV needed to join data across multiple tables or had window functions or made calls to LLM functions." — i.e. the production MV grammar Enzyme handles is a strict superset of this illustrative example.
Numbers disclosed¶
| Quantity | Value | Notes |
|---|---|---|
| Conference | SIGMOD 2026 | Bangalore, India, June 1–5 |
| Award | Honorable mention | for the Enzyme paper |
| arXiv id | 2603.27775 | Enzyme paper |
| Companion paper | VLDB 2026 | Spark Structured Streaming evolution |
| Sponsorship | Platinum | Databricks at SIGMOD 2026 |
| Performance | "significantly better than CV-IVM" | No absolute numbers; competitor anonymised due to licensing |
Caveats¶
- Announcement-shape post, not architecture deep-dive. The technical claims are summarised at one paragraph each; mechanisms (especially for non-deterministic-function handling and the cost model's feature set) are deferred to the paper.
- No absolute performance numbers. The single chart is relative speedup vs CV-IVM; no QPS, latency, throughput, MV size, freshness SLA, or workload-mix disclosure.
- CV-IVM is anonymised — "name anonymized to CV-IVM due to licensing restrictions" — so cross-vendor comparison from the blog post alone is not possible; the paper is presumably permitted to disclose under the SIGMOD review process.
- Non-determinism mechanism not disclosed. The claim of correct
IVM under
current_date()and AI functions is made; the technique (timestamp pinning? per-row caching with invalidation? opt-out taint analysis?) is not stated. - Python-MV change-detection mechanism not disclosed. The claim is made that Enzyme "accurately detect[s] changes in MV definition" for Python; the technique (AST canonicalisation, byte-code hash, dependency closure) is not stated.
- Cost model features not enumerated. "Plan information and prior executions" is the only description; specific signals (cardinality estimates, partition statistics, prior-run timing, cache hit rates) are not listed.
- No workload class boundaries given. It is unclear whether
Enzyme's claimed coverage extends to all of (a) joins of three or
more tables, (b) recursive CTEs, (c) UDFs other than
current_dateand AI functions, (d) MVs over external Iceberg/Delta tables vs managed tables only. - No relationship to the Spark optimiser disclosed. Enzyme is presumably a layer above or beside the Spark logical-plan optimiser (Catalyst); the integration model is not described.
- Streaming-side architecture deferred to VLDB paper. The blog post names the VLDB 2026 paper but does not summarise its contributions.
Source¶
- Original: https://www.databricks.com/blog/databricks-sigmod-2026
- Raw markdown:
raw/databricks/2026-05-29-databricks-at-sigmod-2026-5e442c81.md - Enzyme paper (arXiv): https://arxiv.org/abs/2603.27775
Related¶
- systems/enzyme-ivm — the IVM engine first publicly named in this post.
- systems/lakeflow-spark-declarative-pipelines — the user-facing surface that Enzyme powers.
- systems/spark-streaming — the second SDP track; subject of the companion VLDB 2026 paper.
- systems/apache-spark — the substrate beneath both engines.
- concepts/incremental-view-maintenance — the literature Enzyme extends.
- concepts/materialized-view — the abstraction Enzyme maintains.
- concepts/non-deterministic-mv-maintenance — the claim that distinguishes Enzyme from prior industrial IVM.
- concepts/multi-language-materialized-view — Python + SQL MVs.
- concepts/declarative-vs-imperative-stream-api — the two SDP tracks instantiate this distinction.
- patterns/cost-model-driven-incrementalization-strategy — Enzyme's runtime choice between partition-level and row-level updates.
- companies/databricks — publisher.