Skip to content

CONCEPT Cited by 2 sources

Data Lakehouse

A data lakehouse is a data-platform architectural class that combines:

  • The low-cost, scalable, open-format storage of a data lake (columnar files like Parquet / ORC on object storage like S3 / GCS / ADLS), with
  • The table semantics + transactional guarantees of a data warehouse (ACID row-level inserts/updates/deletes, schema evolution, time travel, consistent reads), supplied by an open table format layer (Iceberg, Delta Lake, or Apache Hudi).

The term was popularised by Databricks to name the architectural endpoint after ~a decade of incremental convergence: data lakes added transactional table formats; data warehouses added compute-storage separation (concepts/compute-storage-separation) and open columnar formats. The lakehouse is the point where they meet.

Why it emerged

Pre-lakehouse, the canonical two-tier analytics architecture was:

  • Data lake (cheap, open-format, massive-scale object storage) — good for append, bad for updates / governance / low-latency query.
  • Data warehouse (expensive, proprietary-format, columnar engine) — good for query, bad for ML / unstructured / streaming.

Organisations ran both, with ETL pipelines moving curated data from the lake into the warehouse. Cost: duplicate storage, duplicate compute, schema drift between tiers, pipeline maintenance burden.

The lakehouse collapses the two by making the lake queryable with warehouse semantics:

  • Open file format → open table format → compute-engine-agnostic queries.
  • One storage layer; many engines read it (Databricks, Snowflake, Trino, Presto, ClickHouse, Dremio, Apache Spark, Apache Flink).

Typical data flow

A lakehouse usually organises data by quality tier, not just raw bucket. The canonical progression is the Medallion Architecture:

  • Bronze — raw ingested data (audit trail).
  • Silver — cleaned, enriched, deduplicated.
  • Gold — aggregated, BI-ready.

Transformations between tiers are canonically ELT — authored as SQL in dbt and scheduled by Airflow / Dagster — or, for latency- sensitive workloads, streaming ETL via Flink (via Iceberg's Flink connector) or Spark Structured Streaming.

Platform primitives that matter

  • Object storage (S3 / GCS / ADLS) — the base-layer storage.
  • Columnar file format (Parquet, ORC) — the object-level data representation.
  • Open table format (Iceberg, Delta Lake, Apache Hudi) — the metadata layer adding ACID + schema evolution + snapshots.
  • Catalog — the naming / governance layer. Examples: Unity Catalog (Databricks), Snowflake Polaris, S3 Tables (AWS), AWS Glue Data Catalog.
  • Compute engines — Spark, Flink, Trino, Presto, Snowflake, Databricks, ClickHouse, etc. All read the same underlying tables.

Costs / caveats

  • Compaction + GC are customer-side operational burdens under externalised table formats; see systems/apache-iceberg for the canonical framing.
  • Governance fragmentation — multiple engines reading the same tables need a common catalog for consistent ACL / schema semantics; catalog interop is still evolving.
  • Query-latency floor — lakehouses target batch / analytical latency; millisecond-class OLTP workloads still belong on dedicated transactional engines (see concepts/oltp-vs-olap).
  • Small-file problem — streaming writes produce many small Parquet files; readers pay scan overhead until compaction runs.

Seen in

  • sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricksHistorical arc of the open lakehouse at joint-Redpanda- Databricks altitude. Walks Hadoop-era data-lake framing ("schema-on-read, flexible ELT workflows, support for multi-structured data, all while significantly lowering costs with cloud object storage") → governance-sprawl problem ("sprawl became a serious challenge, and multiple teams operating on the same datasets introduced issues around governance and reliability") → Iceberg as Netflix-originated resolution ("Iceberg provides a foundation that looks and behaves like a warehouse table, while remaining open and cloud-native") → REST-catalog standardisation → streaming- broker-native integration ("the stream is the table") → Unity-Catalog governance hub. First wiki ingest bracketing the full lake→lakehouse + file-catalog→REST-catalog arcs as one narrative; ends with the "streaming data is analytics-ready by default" framing.
  • sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — frames the lakehouse as the natural home of the Medallion Architecture, with Iceberg as the table-format layer and Parquet / ORC as the file-format layer. Positions Redpanda Iceberg topics as a mechanism that collapses the streaming-broker → lakehouse-Bronze integration step into a broker config flip.
Last updated · 470 distilled / 1,213 read