Skip to content

SYSTEM Cited by 2 sources

Lakeflow Spark Declarative Pipelines (SDP)

Lakeflow Spark Declarative Pipelines (SDP) is Databricks' declarative ingestion-to-features layer on top of Spark. Pipelines are authored in Python (the pyspark.pipelines module, imported as dp) using two decorators:

  • @dp.table — declares a streaming table (bronze-tier ingest, continuously fed).
  • @dp.materialized_view — declares a materialised view (a derived aggregate or transformation kept in sync with upstream tables).

The declarative model handles schema evolution, late- arriving events, and continuous aggregation so pipeline authors describe what the pipeline produces rather than how state is maintained over time.

Stub page. First wiki ingest naming Lakeflow SDP; the ingested source (Databricks multimodal post) uses it as the illustrative wearables-streaming tool inside the governed-Delta-tables-per- modality pattern.

Role in multimodal lakehouse architecture

"Wearables streams introduce operational requirements: schema evolution, late-arriving events, and continuous aggregation. Lakeflow Spark Declarative Pipelines (SDP) provides a robust ingestion-to-features pattern for streaming tables and materialized views." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)

Key properties:

  • Declarative. Author describes outputs + their input dependencies; the runtime chooses incremental recompute vs full rebuild.
  • Streaming-native. @dp.table on a streaming source yields a continuously-updated Delta table.
  • Materialised-view semantics. @dp.materialized_view outputs reflect upstream changes without a manual refresh.
  • Schema evolution + late events handled by the pipeline runtime, not the pipeline author.

Why it matters for sysdesign

Lakeflow SDP is the streaming-side complement to Delta Lake's batch shape: both present a table interface, both are UC-governed, both inherit the lakehouse's reproducibility story (time travel, lineage, MLflow integration). It lets the governed-Delta- tables-per-modality pattern cover streaming modalities (wearables, IoT, clickstreams) without introducing a separate stream-processing tier.

Syntax note from the source

"The pyspark.pipelines module (imported as dp) with @dp.table and @dp.materialized_view decorators follows current Databricks Lakeflow SDP Python semantics." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai)

The explicit currency disclaimer — "current" Python semantics — flags that Databricks' decorator vocabulary has evolved and may evolve again.

Seen in

  • sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Databricks names Lakeflow SDP as the wearables-streaming tool inside its multimodal lakehouse pattern; cited for schema evolution + late-event handling + continuous aggregation over wearables streams. First wiki ingest naming Lakeflow SDP.
  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelinesSDP as the runtime host for AutoCDC, Databricks' declarative CDC / SCD API. Second wiki ingest naming Lakeflow SDP; canonicalises the runtime's load-bearing correctness properties that AutoCDC inherits: incremental-progress tracking, out-of-sequence arrival handling, reprocessing safety, schema evolution, failure recovery without lost or doubled changes — "Lakeflow Spark Declarative Pipelines automatically tracks incremental progress and handles out-of-sequence data. Pipelines can recover from failures, reprocess historical data, and evolve over time without double-applying or losing changes." The AutoCDC API adds CDC/SCD-specific authoring surface (dp.create_auto_cdc_flow with keys, sequence_by, apply_as_deletes, stored_as_scd_type parameters) atop the SDP runtime's general streaming guarantees. Composes with the @dp.view, @dp.table, @dp.materialized_view, dp.create_streaming_table primitives disclosed in the multimodal post. Perf gains disclosed as Databricks Runtime improvements since Nov 2025: 71% better perf-per-dollar on SCD Type 1, 96% on SCD Type 2 workloads — propagated universally to AutoCDC pipelines because the declarative API lets engine-level optimisations apply to every AutoCDC flow without author intervention. Named regulated-vertical adopters at production scale: Navy Federal Credit Union, Block, Valora Group. First wiki source to canonicalise SDP's CDC/SCD API surface (distinct from the streaming-wearables role in the prior source).
Last updated · 517 distilled / 1,221 read