SYSTEM Cited by 3 sources
Lakeflow Spark Declarative Pipelines (SDP)¶
Lakeflow Spark Declarative Pipelines (SDP) is Databricks'
declarative ingestion-to-features layer on top of
Spark. Pipelines are authored in Python
(the pyspark.pipelines module, imported as dp) using two
decorators:
@dp.table— declares a streaming table (bronze-tier ingest, continuously fed).@dp.materialized_view— declares a materialised view (a derived aggregate or transformation kept in sync with upstream tables).
The declarative model handles schema evolution, late- arriving events, and continuous aggregation so pipeline authors describe what the pipeline produces rather than how state is maintained over time.
Two-track incremental-processing architecture¶
Underneath the user-facing decorators, SDP is two engines, not one. The 2026-05-29 Databricks at SIGMOD 2026 post is the first wiki source to disclose this clearly:
"There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline."
| Track | Decorator | Engine | Companion paper |
|---|---|---|---|
| Materialized-view track | @dp.materialized_view |
Enzyme IVM engine | SIGMOD 2026 honorable mention — "Enzyme: Incremental View Maintenance for Data Engineering" (arXiv:2603.27775) |
| Streaming track | @dp.table + Structured Streaming APIs (stateful operators, watermarks, custom aggregations) |
Spark Structured Streaming | VLDB 2026 — "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs" |
SDP pipeline definition
│
┌────────────────────┴────────────────────┐
▼ ▼
@dp.materialized_view @dp.table (streaming)
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Enzyme │ │ Structured │
│ IVM engine │ │ Streaming engine │
└─────────────┘ └──────────────────┘
▲ ▲
└────── mix-and-match within ────────┘
a single pipeline
The two tracks instantiate the declarative-vs-imperative stream-API distinction inside one pipeline definition: MV authors describe what, streaming authors describe how.
Stub page. First wiki ingest naming Lakeflow SDP; the ingested source (Databricks multimodal post) uses it as the illustrative wearables-streaming tool inside the governed-Delta-tables-per- modality pattern.
Role in multimodal lakehouse architecture¶
"Wearables streams introduce operational requirements: schema evolution, late-arriving events, and continuous aggregation. Lakeflow Spark Declarative Pipelines (SDP) provides a robust ingestion-to-features pattern for streaming tables and materialized views." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)
Key properties:
- Declarative. Author describes outputs + their input dependencies; the runtime chooses incremental recompute vs full rebuild.
- Streaming-native.
@dp.tableon a streaming source yields a continuously-updated Delta table. - Materialised-view semantics.
@dp.materialized_viewoutputs reflect upstream changes without a manual refresh. - Schema evolution + late events handled by the pipeline runtime, not the pipeline author.
Why it matters for sysdesign¶
Lakeflow SDP is the streaming-side complement to Delta Lake's batch shape: both present a table interface, both are UC-governed, both inherit the lakehouse's reproducibility story (time travel, lineage, MLflow integration). It lets the governed-Delta- tables-per-modality pattern cover streaming modalities (wearables, IoT, clickstreams) without introducing a separate stream-processing tier.
Syntax note from the source¶
"The
pyspark.pipelinesmodule (imported asdp) with@dp.tableand@dp.materialized_viewdecorators follows current Databricks Lakeflow SDP Python semantics." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)
The explicit currency disclaimer — "current" Python semantics — flags that Databricks' decorator vocabulary has evolved and may evolve again.
Seen in¶
- sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Databricks names Lakeflow SDP as the wearables-streaming tool inside its multimodal lakehouse pattern; cited for schema evolution + late-event handling + continuous aggregation over wearables streams. First wiki ingest naming Lakeflow SDP.
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines
— SDP as the runtime host for
AutoCDC, Databricks' declarative CDC / SCD API. Second wiki
ingest naming Lakeflow SDP; canonicalises the runtime's
load-bearing correctness properties that AutoCDC inherits:
incremental-progress tracking, out-of-sequence arrival handling,
reprocessing safety, schema evolution, failure recovery without
lost or doubled changes — "Lakeflow Spark Declarative Pipelines
automatically tracks incremental progress and handles
out-of-sequence data. Pipelines can recover from failures,
reprocess historical data, and evolve over time without
double-applying or losing changes." The AutoCDC API adds
CDC/SCD-specific authoring surface (
dp.create_auto_cdc_flowwithkeys,sequence_by,apply_as_deletes,stored_as_scd_typeparameters) atop the SDP runtime's general streaming guarantees. Composes with the@dp.view,@dp.table,@dp.materialized_view,dp.create_streaming_tableprimitives disclosed in the multimodal post. Perf gains disclosed as Databricks Runtime improvements since Nov 2025: 71% better perf-per-dollar on SCD Type 1, 96% on SCD Type 2 workloads — propagated universally to AutoCDC pipelines because the declarative API lets engine-level optimisations apply to every AutoCDC flow without author intervention. Named regulated-vertical adopters at production scale: Navy Federal Credit Union, Block, Valora Group. First wiki source to canonicalise SDP's CDC/SCD API surface (distinct from the streaming-wearables role in the prior source). - sources/2026-05-29-databricks-databricks-at-sigmod-2026 —
First wiki disclosure of the two-track architecture beneath
SDP, and the first naming of Enzyme as
the IVM engine behind
@dp.materialized_view. "There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline." The two tracks: (a) the materialized-view track —@dp.materialized_view→ Enzyme IVM engine (subject of the Databricks SIGMOD 2026 honorable-mention paper "Enzyme: Incremental View Maintenance for Data Engineering", arXiv:2603.27775); (b) the streaming track —@dp.table+ Structured Streaming APIs with stateful operators, watermarks, and custom aggregations (subject of the companion VLDB 2026 paper "A Decade of Apache Spark Structured Streaming"). Authors can use either or both within a single pipeline. The MV track's thesis — quoted from the source — is MV-as-ETL-primitive: "Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." Enzyme's four claimed industrial novelties on top of the prior IVM literature: full MV-grammar coverage including joins + windows + aggregations + combinations; non-deterministic function support (current_date(), AI functions); multi-language MVs (Python + SQL); cost-model-driven incrementalisation strategy (partition-level vs row-level updates per run, selective intermediate-result caching, plan-info - prior-execution-stats inputs). This source clarifies that the
earlier-disclosed
@dp.materialized_viewdecorator (multimodal / AutoCDC sources) is not a magic incantation but a surface that binds to a specific named IVM engine — Enzyme — with explicit paper-level documentation.