Skip to content

SYSTEM Cited by 3 sources

Lakeflow Spark Declarative Pipelines (SDP)

Lakeflow Spark Declarative Pipelines (SDP) is Databricks' declarative ingestion-to-features layer on top of Spark. Pipelines are authored in Python (the pyspark.pipelines module, imported as dp) using two decorators:

  • @dp.table — declares a streaming table (bronze-tier ingest, continuously fed).
  • @dp.materialized_view — declares a materialised view (a derived aggregate or transformation kept in sync with upstream tables).

The declarative model handles schema evolution, late- arriving events, and continuous aggregation so pipeline authors describe what the pipeline produces rather than how state is maintained over time.

Two-track incremental-processing architecture

Underneath the user-facing decorators, SDP is two engines, not one. The 2026-05-29 Databricks at SIGMOD 2026 post is the first wiki source to disclose this clearly:

"There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline."

Track Decorator Engine Companion paper
Materialized-view track @dp.materialized_view Enzyme IVM engine SIGMOD 2026 honorable mention — "Enzyme: Incremental View Maintenance for Data Engineering" (arXiv:2603.27775)
Streaming track @dp.table + Structured Streaming APIs (stateful operators, watermarks, custom aggregations) Spark Structured Streaming VLDB 2026 — "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs"
                      SDP pipeline definition
         ┌────────────────────┴────────────────────┐
         ▼                                         ▼
  @dp.materialized_view                  @dp.table (streaming)
         │                                         │
         ▼                                         ▼
  ┌─────────────┐                         ┌──────────────────┐
  │  Enzyme     │                         │ Structured       │
  │  IVM engine │                         │ Streaming engine │
  └─────────────┘                         └──────────────────┘
                ▲                                     ▲
                └────── mix-and-match within ────────┘
                        a single pipeline

The two tracks instantiate the declarative-vs-imperative stream-API distinction inside one pipeline definition: MV authors describe what, streaming authors describe how.

Stub page. First wiki ingest naming Lakeflow SDP; the ingested source (Databricks multimodal post) uses it as the illustrative wearables-streaming tool inside the governed-Delta-tables-per- modality pattern.

Role in multimodal lakehouse architecture

"Wearables streams introduce operational requirements: schema evolution, late-arriving events, and continuous aggregation. Lakeflow Spark Declarative Pipelines (SDP) provides a robust ingestion-to-features pattern for streaming tables and materialized views." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)

Key properties:

  • Declarative. Author describes outputs + their input dependencies; the runtime chooses incremental recompute vs full rebuild.
  • Streaming-native. @dp.table on a streaming source yields a continuously-updated Delta table.
  • Materialised-view semantics. @dp.materialized_view outputs reflect upstream changes without a manual refresh.
  • Schema evolution + late events handled by the pipeline runtime, not the pipeline author.

Why it matters for sysdesign

Lakeflow SDP is the streaming-side complement to Delta Lake's batch shape: both present a table interface, both are UC-governed, both inherit the lakehouse's reproducibility story (time travel, lineage, MLflow integration). It lets the governed-Delta- tables-per-modality pattern cover streaming modalities (wearables, IoT, clickstreams) without introducing a separate stream-processing tier.

Syntax note from the source

"The pyspark.pipelines module (imported as dp) with @dp.table and @dp.materialized_view decorators follows current Databricks Lakeflow SDP Python semantics." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)

The explicit currency disclaimer — "current" Python semantics — flags that Databricks' decorator vocabulary has evolved and may evolve again.

Seen in

  • sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Databricks names Lakeflow SDP as the wearables-streaming tool inside its multimodal lakehouse pattern; cited for schema evolution + late-event handling + continuous aggregation over wearables streams. First wiki ingest naming Lakeflow SDP.
  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelinesSDP as the runtime host for AutoCDC, Databricks' declarative CDC / SCD API. Second wiki ingest naming Lakeflow SDP; canonicalises the runtime's load-bearing correctness properties that AutoCDC inherits: incremental-progress tracking, out-of-sequence arrival handling, reprocessing safety, schema evolution, failure recovery without lost or doubled changes — "Lakeflow Spark Declarative Pipelines automatically tracks incremental progress and handles out-of-sequence data. Pipelines can recover from failures, reprocess historical data, and evolve over time without double-applying or losing changes." The AutoCDC API adds CDC/SCD-specific authoring surface (dp.create_auto_cdc_flow with keys, sequence_by, apply_as_deletes, stored_as_scd_type parameters) atop the SDP runtime's general streaming guarantees. Composes with the @dp.view, @dp.table, @dp.materialized_view, dp.create_streaming_table primitives disclosed in the multimodal post. Perf gains disclosed as Databricks Runtime improvements since Nov 2025: 71% better perf-per-dollar on SCD Type 1, 96% on SCD Type 2 workloads — propagated universally to AutoCDC pipelines because the declarative API lets engine-level optimisations apply to every AutoCDC flow without author intervention. Named regulated-vertical adopters at production scale: Navy Federal Credit Union, Block, Valora Group. First wiki source to canonicalise SDP's CDC/SCD API surface (distinct from the streaming-wearables role in the prior source).
  • sources/2026-05-29-databricks-databricks-at-sigmod-2026First wiki disclosure of the two-track architecture beneath SDP, and the first naming of Enzyme as the IVM engine behind @dp.materialized_view. "There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline." The two tracks: (a) the materialized-view track — @dp.materialized_viewEnzyme IVM engine (subject of the Databricks SIGMOD 2026 honorable-mention paper "Enzyme: Incremental View Maintenance for Data Engineering", arXiv:2603.27775); (b) the streaming track — @dp.table + Structured Streaming APIs with stateful operators, watermarks, and custom aggregations (subject of the companion VLDB 2026 paper "A Decade of Apache Spark Structured Streaming"). Authors can use either or both within a single pipeline. The MV track's thesis — quoted from the source — is MV-as-ETL-primitive: "Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." Enzyme's four claimed industrial novelties on top of the prior IVM literature: full MV-grammar coverage including joins + windows + aggregations + combinations; non-deterministic function support (current_date(), AI functions); multi-language MVs (Python + SQL); cost-model-driven incrementalisation strategy (partition-level vs row-level updates per run, selective intermediate-result caching, plan-info
  • prior-execution-stats inputs). This source clarifies that the earlier-disclosed @dp.materialized_view decorator (multimodal / AutoCDC sources) is not a magic incantation but a surface that binds to a specific named IVM engine — Enzyme — with explicit paper-level documentation.
Last updated · 542 distilled / 1,571 read