CONCEPT Cited by 3 sources
Data Lakehouse¶
A data lakehouse is a data-platform architectural class that combines:
- The low-cost, scalable, open-format storage of a data lake (columnar files like Parquet / ORC on object storage like S3 / GCS / ADLS), with
- The table semantics + transactional guarantees of a data warehouse (ACID row-level inserts/updates/deletes, schema evolution, time travel, consistent reads), supplied by an open table format layer (Iceberg, Delta Lake, or Apache Hudi).
The term was popularised by Databricks to name the architectural endpoint after ~a decade of incremental convergence: data lakes added transactional table formats; data warehouses added compute-storage separation (concepts/compute-storage-separation) and open columnar formats. The lakehouse is the point where they meet.
Why it emerged¶
Pre-lakehouse, the canonical two-tier analytics architecture was:
- Data lake (cheap, open-format, massive-scale object storage) — good for append, bad for updates / governance / low-latency query.
- Data warehouse (expensive, proprietary-format, columnar engine) — good for query, bad for ML / unstructured / streaming.
Organisations ran both, with ETL pipelines moving curated data from the lake into the warehouse. Cost: duplicate storage, duplicate compute, schema drift between tiers, pipeline maintenance burden.
The lakehouse collapses the two by making the lake queryable with warehouse semantics:
- Open file format → open table format → compute-engine-agnostic queries.
- One storage layer; many engines read it (Databricks, Snowflake, Trino, Presto, ClickHouse, Dremio, Apache Spark, Apache Flink).
Typical data flow¶
A lakehouse usually organises data by quality tier, not just raw bucket. The canonical progression is the Medallion Architecture:
- Bronze — raw ingested data (audit trail).
- Silver — cleaned, enriched, deduplicated.
- Gold — aggregated, BI-ready.
Transformations between tiers are canonically ELT — authored as SQL in dbt and scheduled by Airflow / Dagster — or, for latency- sensitive workloads, streaming ETL via Flink (via Iceberg's Flink connector) or Spark Structured Streaming.
Platform primitives that matter¶
- Object storage (S3 / GCS / ADLS) — the base-layer storage.
- Columnar file format (Parquet, ORC) — the object-level data representation.
- Open table format (Iceberg, Delta Lake, Apache Hudi) — the metadata layer adding ACID + schema evolution + snapshots.
- Catalog — the naming / governance layer. Examples: Unity Catalog (Databricks), Snowflake Polaris, S3 Tables (AWS), AWS Glue Data Catalog.
- Compute engines — Spark, Flink, Trino, Presto, Snowflake, Databricks, ClickHouse, etc. All read the same underlying tables.
Costs / caveats¶
- Compaction + GC are customer-side operational burdens under externalised table formats; see systems/apache-iceberg for the canonical framing.
- Governance fragmentation — multiple engines reading the same tables need a common catalog for consistent ACL / schema semantics; catalog interop is still evolving.
- Query-latency floor — lakehouses target batch / analytical latency; millisecond-class OLTP workloads still belong on dedicated transactional engines (see concepts/oltp-vs-olap).
- Small-file problem — streaming writes produce many small Parquet files; readers pay scan overhead until compaction runs.
Seen in¶
- sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Lakehouse-as-multimodal-substrate face. Databricks positions the lakehouse as the architectural unifier across modalities (genomics, imaging, clinical notes, wearables) in opposition to the "specialty store per modality" anti-pattern (FHIR store + omics store + imaging store + vector store). All modalities land in governed Delta tables under one Unity Catalog governance surface; modality-specific tooling (Glow for genomics, Mosaic AI Vector Search for imaging-similarity over embeddings, NLP for clinical-notes entities, Lakeflow SDP for wearables streams) sits above the substrate. See patterns/governed-delta-tables-per-modality. First wiki ingest framing the lakehouse specifically as a multimodal- integration substrate rather than an analytics-query substrate.
- sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricks — Historical arc of the open lakehouse at joint-Redpanda- Databricks altitude. Walks Hadoop-era data-lake framing ("schema-on-read, flexible ELT workflows, support for multi-structured data, all while significantly lowering costs with cloud object storage") → governance-sprawl problem ("sprawl became a serious challenge, and multiple teams operating on the same datasets introduced issues around governance and reliability") → Iceberg as Netflix-originated resolution ("Iceberg provides a foundation that looks and behaves like a warehouse table, while remaining open and cloud-native") → REST-catalog standardisation → streaming- broker-native integration ("the stream is the table") → Unity-Catalog governance hub. First wiki ingest bracketing the full lake→lakehouse + file-catalog→REST-catalog arcs as one narrative; ends with the "streaming data is analytics-ready by default" framing.
- sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — frames the lakehouse as the natural home of the Medallion Architecture, with Iceberg as the table-format layer and Parquet / ORC as the file-format layer. Positions Redpanda Iceberg topics as a mechanism that collapses the streaming-broker → lakehouse-Bronze integration step into a broker config flip.
Related¶
- concepts/medallion-architecture — the canonical lakehouse data-organisation pattern.
- concepts/open-table-format · concepts/open-file-format — the storage substrate.
- concepts/compute-storage-separation — the architectural shift that made the lakehouse economically viable.
- concepts/elt-vs-etl — the transformation model lakehouses use.
- systems/databricks · systems/snowflake · systems/apache-iceberg · systems/s3-tables · systems/unity-catalog — canonical implementations.