Skip to content

SYSTEM Cited by 12 sources

Delta Lake

Delta Lake is an open-source concepts/open-table-format built over systems/apache-parquet on object storage. It is one of the three canonical OTFs alongside systems/apache-iceberg and Apache Hudi, and the table format native to Databricks' Data Intelligence Platform.

Minimum viable framing for this wiki: it plays the same architectural role as Iceberg — ACID transactions, schema evolution, time-travel, snapshot-based metadata over immutable columnar files. See concepts/open-table-format for the shared shape and the gap the format class fills above concepts/immutable-object-storage.

Features cited in ingested sources

  • Deep Clone. Incrementally materialises a snapshot of a Delta table (typically backed by another Delta table, or a Delta-Sharing share) as a new, physically separate Delta table in the clone's object store. Subsequent deep-clone runs transfer only the delta since the previous clone. This is the replication primitive Mercedes-Benz's cross-cloud Sync Jobs use — it's the thing that makes patterns/cross-cloud-replica-cache economically viable at 60 TB.
  • VACUUM. Cleans up files referenced by old snapshots once retention policy has passed. Mercedes-Benz leans on VACUUM on the replicated Delta tables to enforce GDPR right-to-be-forgotten: deletions on the source propagate through the next Deep Clone sync; old files are then vacuumed out of ADLS on the recipient side.

(Source: sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh)

Seen in

  • sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governanceFormat-co-evolution-with-Iceberg face (eleventh Delta Lake face on the wiki). Two distinct disclosures land together: (a) cross-format compatibility for deletion vectors / row tracking / VARIANT type is named explicitly in the context of Iceberg v3 reaching parity: "these features also work seamlessly across both Delta and Iceberg tables, enabling interoperability without rewriting data." Delta had analogues of all three earlier; v3 brings Iceberg parity, and the cross-format compatibility is the architectural precondition for (b) — the forward-looking proposal that Delta 5.0 adopt the same adaptive metadata tree structure that Iceberg v4 will introduce. Verbatim: "With Iceberg v4, we are rethinking the core metadata structure from the ground up […] we are also proposing that the next version of Delta, Delta 5.0, adopts the adaptive metadata tree structure." Canonicalised as concepts/format-co-evolution-iceberg-delta. Tier-3 marketing-roundup framing acknowledged in the source page; Delta 5.0 is a proposal not a shipping spec; no mechanism depth on the adaptive metadata tree internals (deferred to conference session). Architectural significance: the wiki's first canonical disclosure of an explicit OTF-format-convergence direction — Delta and Iceberg sharing core metadata internals, with catalog-side bridges (UC bi-format Delta Sharing / managed Iceberg / cross-engine ABAC) making the format choice operational rather than strategic.

  • sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reductionMulti-terabyte upstream-substrate-with-CDF face (tenth Delta Lake face on the wiki). Delta Lake reframed as the shared multi-grain source-of-truth substrate under a three-stream grain-aligned data pipeline — and the CDF feature applied to that substrate is the load-bearing optimisation. The Octopus Energy MHHS rebuild canonicalises both ends of the pattern: (a) the unified multi-terabyte multi-grain source-of-truth Delta tables that consolidate meter reads, smart meter data, and industry flows under three independently- tunable streams (Settlement / Half-Hourly / Monthly), and (b) the Delta CDF face that turns full-table-overwrite-on-every- run into change-driven incremental processing, dropping rows / run from 25 B → 300 M (98.8% reduction) with weekly → daily freshness. New Liquid Clustering face surfaces explicitly: "Liquid clustering dynamically co-locates related records on the specified clustering keys without requiring fixed partition boundaries. Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning." Canonicalised as patterns/cdf-incremental-replacing-full-rescan + systems/octopus-margin-data-pipeline + systems/liquid-clustering.

  • sources/2026-05-14-databricks-expanded-interoperability-with-unity-catalog-open-apisExternal-engine-managed-write face (ninth Delta Lake face on the wiki). Delta Lake reframed as the substrate that external engines (Apache Spark, Apache Flink, DuckDB) can now create, read, write, and stream to/from as UC Managed Tables under Unity Catalog governance — preserving Predictive Optimization ("up to 20× faster queries and 50% lower storage costs") and Liquid Clustering across the engine boundary. The two architectural enablers: (a) Delta Kernel — the open-source Java + Rust library — abstracts the Delta protocol so "connector developers can focus on UC integration, not Delta implementation"; (b) catalog commits route every commit through UC, producing serialized commits + complete auditability + multi-table-transaction substrate. Predictive Optimization is engine-boundary- transparent: "Predictive Optimization continues to run seamlessly, even on tables accessed by external engines." Auth-side complement: UC Credential Vending (M2M OAuth + auto-refresh) for the data-path access. Canonical instance of concepts/external-engine-write-to-managed-table. Beta version pinning: Delta-Spark 4.2 + Unity Catalog 0.4.1.

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryAppend-only-Bronze + Change-Data-Feed face. Eighth Delta Lake face on the wiki: not just analytical-table-storage / observability-storage / sketch-column-storage / multimodal-substrate / document-extraction-output / governed-SHAP-attribution-storage (the prior seven faces) but the append-only Bronze substrate driving an entity-resolution pipeline through Delta Change Data Feed. "Raw, heterogeneous JSON payloads are captured in append-only Delta tables. From there, a promotion pipeline — reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema. By utilizing Delta Lake's schema evolution and time travel, Claroty maintains an unbreakable chain of custody; every asset record is traceable back to its original raw artifact and the specific mapping version that classified it." The two load-bearing properties for ER audit chain: (a) CDF is the layer-transition trigger between Bronze raw and Silver canonical (see concepts/delta-change-data-feed); (b) schema evolution + time travel anchors both data lineage and classifier lineage (the mapping-registry version that produced the canonical record). Canonical wiki home for the CDF face of Delta Lake at concepts/delta-change-data-feed; composes with patterns/hybrid-classical-er-plus-genai in systems/claroty-cps-library (17M+ asset CPS catalog).

  • sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouseGoverned-SHAP-attribution-storage face. New Delta Lake face on the wiki: not just analytical-table-storage / observability- storage / sketch-column-storage / multimodal-substrate / document-extraction-output (the prior six faces) but the storage substrate for per-prediction ML SHAP attributions in regulated decision-support apps. "Every prediction carries a SHAP attribution stored as a governed Unity Catalog Delta table — versioned in MLflow, lineaged through Unity Catalog, queryable — the rationale behind a site selection is as auditable as the score itself." The load-bearing Delta properties: (a) ACID writes mean each per-prediction attribution row is durable and consistent with the prediction; (b) schema evolution lets the attribution schema track feature-set changes across model retrains; (c) time-travel allows population queries that span model versions; (d) the table is queryable in SQL so a regulator's question ("why was this site recommended?") becomes a SELECT, and a fairness audit ("are community sites systematically under-weighted?") becomes a GROUP BY. Reference implementation: systems/site-feasibility-workbench writes SHAP attributions for TA-segmented LightGBM site-feasibility predictions into UC-governed Delta tables. Canonical wiki substrate for the patterns/shap-attribution-as-governed-delta-table pattern and the concepts/governed-shap-attribution-table concept.

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoringLakehouse-native observability-storage face. Delta Lake as the storage substrate for Hydra's 20 billion unaggregated active timeseries from millions of nodes worldwide. Spark Structured Streaming + Auto Loader write metrics into Delta tables with exactly-once semantics, per-region-partitioned independent ingestion jobs that autoscale independently and reduce cross-region blast radius. Claimed ~50× cheaper storage than Thanos for the raw tier, ~5 minute end-to-end freshness. The columnar scan + Unity-Catalog-governance + joinable-with-enterprise- datasets properties are all load-bearing — observability data becomes a first-class analytical asset. PromQL-to-SQL translation runs Grafana dashboards against the Delta tables unmodified. Canonical instance of concepts/lakehouse-native-observability.

  • sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-aiMultimodal-substrate face: Delta is the one storage format every modality lands in inside Databricks' governed-Delta- tables-per-modality pattern — genomics (via Glow), imaging-derived feature embeddings (indexed by Mosaic AI Vector Search), NLP-extracted clinical-notes entities, and wearables streaming tables + materialised views produced by Lakeflow SDP. ACID + time travel are called out as the mechanism powering reproducibility ("consistent training sets and re-analysis") across the full multimodal dataset.
  • sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Delta format for the local replica tier; Deep Clone as the incremental-sync primitive; VACUUM as the GDPR-compliance hook on the replica side.
  • sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named as one of the three open table formats Amazon BDT's Ray compactor (systems/deltacat) is being extended to support, alongside systems/apache-iceberg and systems/apache-hudi. The BDT in-house copy-on-write compactor predates these OTFs and gave its design back to DeltaCAT as the Flash Compactor.
  • sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gatewaytelemetry-destination face: Unity AI Gateway auto-lands coding-agent OpenTelemetry metrics + traces into Unity-Catalog-managed Delta tables, making AI-tool telemetry a first-class Lakehouse dataset joinable with HR / PR-velocity / capacity-planning data. See patterns/telemetry-to-lakehouse.
  • sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analyticsSketch-column-storage face: Delta tables as the intended resting place for DataSketches BLOB columns (KLL, Theta, approx top-K, Tuple) built once during ETL and merged at read time. See patterns/precomputed-sketch-column-in-delta-table. Delta's schema evolution + snapshot isolation + time travel make it a natural home for append-mostly hourly rollup tables whose BLOB-column sketches are merged on dashboard read.
  • sources/2026-05-11-databricks-unlocking-the-archivesdocument-extraction-output substrate face. Delta tables hold the multi-stage pipeline output: page-level classification rows ( Dewey Decimal + geographies + water flag), inline judge scores + written justifications, document-level aggregated classifications, and the 299 structured well/borehole records emitted via schema-constrained ai_query. ACID + lineage + schema-evolution properties give the pipeline a typed, auditable output surface that composes with downstream MapAid WellMapr models.

  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfoTransaction-log-based-pruning face (twelfth Delta Lake face on the wiki). The architectural core of the post is the load-bearing claim about how pruning works on Delta: "Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure. The engine never lists directories to plan a query. It reads the transaction log, evaluates filters against statistics, and skips files that don't match." This canonicalises concepts/file-level-data-skipping as a wiki concept and positions it as the only pruning mechanism that exists on modern OTFs — directory-level pruning is myth, not benefit. Pairs with the disclosure that the same per-file min/max statistics power both data skipping AND metadata-only operations (DELETE / COUNT / DISTINCT / GROUP BY) — "the engine uses the same per-file min/max stats it uses for data skipping to determine when a query's answer can be computed from metadata alone". Operational disclosures: ~90% faster metadata-only DELETEs; up to 27× faster aggregate queries. Adjacent canonicalisations: concepts/row-level-concurrency (Delta format property explicitly named — "Liquid provides row-level concurrency. Two writers updating different rows no longer conflict, even if those rows live in the same file") and concepts/z-ordering (the older Delta clustering technique that Liquid Clustering supersedes; structural problems — "poor clustering quality" + "unnecessary rewrites" — canonicalised as deprecated mechanism). Architectural significance: this is the wiki's first source that takes Delta's transaction-log-based pruning model as the load-bearing architectural fact (vs. earlier sources that treated it as background context). The post's whole case against Hive partitioning rests on it.

Last updated · 542 distilled / 1,571 read