SYSTEM Cited by 3 sources

Liquid Clustering¶

Liquid Clustering is a Delta Lake feature that dynamically co-locates related records on user-specified clustering keys without requiring fixed partition boundaries. It replaces the legacy choice between "partition by column X" (which ages badly when the workload changes) and "don't partition" (which forces full-table scans on common predicates). For the wiki it is the partition-replacement primitive that turns up in nearly every recent Databricks-source page in the corpus — UC OTel trace tables, Zerobus-managed tables, UC managed tables under external-engine write, and the Octopus Energy margin data pipeline. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Why it exists — the over-partitioning trap¶

Traditional table partitioning forces an architect to commit, at table-creation time, to a fixed column choice (date, region, customer-tier, etc.) that defines a directory-shaped layout in object storage. Three failure modes emerge at scale:

Failure mode	What goes wrong
Small-file problem	Each partition becomes a directory; high-cardinality partition columns produce many directories with few small files. Read planning, list operations, and metadata IO all explode.
Higher memory consumption	Many small partitions → many partition descriptors held in memory by the query planner → driver-side memory pressure and slower planning.
I/O overhead from over-partitioning	When the chosen column is finer-grained than queries need, every read touches more partition boundaries than necessary.

The Octopus Energy MHHS rebuild names all three explicitly:

"Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning."

What it does¶

Liquid clustering co-locates related records on the clustering keys at storage-layout time without requiring those keys to define directory partitions. The architect specifies which columns matter for filter/join performance; Delta arranges the file layout so that queries on those columns can prune effectively, without committing to a fixed bucketing scheme that ages badly.

Three load-bearing properties for the wiki:

No fixed partition boundaries. Records are co-located by proximity on the clustering keys; the system can rebalance as data arrives without forcing a re-partition operation.
Multi-column clustering. Multiple clustering keys can be set; the layout optimises for queries that filter on any of them.
Compatible with Predictive Optimization. The Databricks managed-storage layer (UC managed tables) refreshes the layout automatically.

Where it shows up on the wiki¶

The recurring shape across Databricks-source pages: liquid clustering is enabled across multiple tables for columns frequently used in filters and joins, replacing what would have historically been partition-by clauses.

Source / system	Clustering use
systems/octopus-margin-data-pipeline	"Liquid clustering was enabled across multiple tables for columns frequently used in filters and joins" — applied across the three-stream pipeline's tables
systems/uc-otel-trace-tables	Auto-liquid-clustered post a recent product update — keys not disclosed (likely trace_id / time-based)
systems/uc-managed-tables	One of the "managed-table-only properties" preserved across external-engine writes
systems/zerobus-ingest	The receiving Delta tables are auto-liquid-clustered

In the Octopus case, liquid clustering appears in the "Join and partition tuning" category alongside broadcast joins for reference tables <500 MB. The combination — broadcast on the small side, liquid clustering on the large side — is the disclosed recipe for the multi-key joins with date ranges that drive the margin pipeline.

Eight myths debunked (2026-06-01)¶

The 2026-06-01 "Debunking 8 data layout myths" post is the most architecturally dense Liquid Clustering disclosure on the wiki — it addresses each of the eight defender-of-partitioning arguments in turn and supplies operational evidence at PB scale. Summary table:

Myth	Reality	Wiki canonical
#1: Partitioning is faster — directories prune without opening files	"Directory-pruning does not exist on modern open table formats like Delta and Iceberg" — pruning is per-file via transaction-log statistics	concepts/file-level-data-skipping
#2: Partitioning is better on low-cardinality columns	Liquid auto-applies low-cardinality optimization; per-file=single-low-cardinality-value with high-cardinality nested sort; 35% lower clustering time, 22% faster queries	concepts/low-cardinality-clustering-optimization
#3: Liquid Clustering doesn't support metadata-only operations	DELETEs / COUNT / DISTINCT / GROUP BY all supported; ~90% faster metadata-only DELETEs; up to 27× aggregate speedup	concepts/metadata-only-operation
#4: Liquid Clustering doesn't work at PB scale	OPTIMIZE planning 12h → 23m on 10 PB tables; execution 5× faster; "dozens of customers now have PB-scale Liquid Clustered tables in production"	(this page)
#5: Liquid Clustering only benefits Databricks readers	Write-side optimization producing standard Parquet+stats; "Any compatible reader (e.g. Apache Spark, DuckDB, etc.) can use those stats to skip files"	(this page)
#6: Concurrent ETL needs partition-as-write-boundary	Row-level concurrency — "two writers updating different rows no longer conflict, even if those rows live in the same file"	concepts/row-level-concurrency
#7: Z-ORDER patches what partitioning misses	Z-ORDER has "poor clustering quality" + "unnecessary rewrites"; Liquid clusters incrementally on write, layout stays optimal without periodic rebuilds	concepts/z-ordering · patterns/incremental-clustering-on-write
#8: Selective overwrite needs Dynamic Partition Overwrite	REPLACE USING / REPLACE ON work on any layout (clustered, partitioned, unclustered) and any compute (classic, SQL warehouses, Serverless); atomic, any-column matching	patterns/replace-using-and-replace-on-for-selective-overwrite

The post canonicalises concepts/over-partitioning as the load-bearing failure mode ("more than 75% of cases" on the Databricks customer base) and the clustering-keys-as-engine-input inversion as the architectural answer.

Production case studies at PB scale (2026-06-01)¶

The 2026-06-01 source supplies three production case studies of the partition→Liquid migration shape:

Arctic Wolf — 3.8+ PB security telemetry¶

Scale: 3.8+ PB; 1+ trillion events/day
Workload: Threat hunters running 90-day queries on high-cardinality filter columns (user, host, IP, hash)
Outcome: 90-day queries 51s → 6.6s (7.7×); file count 4M → 2M; data freshness hours → minutes
Built on UC managed tables + Predictive Optimization + Delta + Liquid
Canonicalised at systems/arctic-wolf-security-telemetry-table

Bolt — TB-scale CDC table¶

Migration: in-place Liquid Conversion (Private Preview) — ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY
Outcome: Write throughput +138%; read time −21% avg / −63% max across 9 representative queries; zero downtime alongside live ingestion
The zero-downtime live-ingestion property validates the in-place conversion pattern on a workload class (CDC) that cannot tolerate cutover downtime.

Databricks-internal — 1.1 PB → 0.8 PB dashboard table¶

Scale: 1.1 PB → 0.8 PB (−27% storage reduction post-Liquid)
Pre-migration: partitioned by (date, hour); workload also filtered on source and id but those columns were too high-cardinality to partition on
Post-migration: clustered by (date, hour, source, id) — validates multi-dimensional clustering on mixed cardinalities
Outcome: Wall clock 406s → 70s (5.9×); bytes read 3.5 TB → 0.48 TB (−86%); table size −27%

The internal case discloses the structural reason the migration paid off: queries needed multi-dimensional filter pruning that partitioning's single-column choice couldn't deliver.

OPTIMIZE engineering improvements (2026-06-01)¶

The 2026-06-01 source quantifies the engineering work behind PB-scale viability. Two-phase OPTIMIZE breakdown:

Phase	Pre-improvement worst case	Post-improvement	Improvement
Planning (phase 1)	Up to 12 hours on 10 PB tables	23 minutes	31×
Execution (phase 2)	Baseline	5× faster on Medium DBSQL clusters	5×

This is the engineering disclosure behind the "Liquid Clustering works at PB scale" claim. Without these improvements, the periodic-OPTIMIZE-cycle cost on PB tables was prohibitive; the post-improvement envelope (23-minute planning, 5× faster execution) makes daily-or-better OPTIMIZE cycles tractable.

Forward: co-clustered joins (Private Preview, 2026-06-01)¶

The 2026-06-01 source discloses co-clustered joins as a Private Preview primitive: when both join sides are Liquid-clustered on the join key, shuffle is eliminated. Disclosed envelope on a real-world data warehousing benchmark: ~51% faster wall clock (28 minutes → 14 minutes) and 87% less shuffle data (1.2 TiB → 150 GiB).

Architecturally significant because it makes the multi-dimensional clustering payoff compound across both filter and join workloads. The same column-choice decision that supports filter pruning also supports join shuffle elimination, provided the join key is among the clustering keys.

When to choose liquid clustering vs traditional partitioning¶

Choose	When
Liquid clustering	Filter/join columns evolve over time; cardinality is too high for clean directory partitioning; small-file pressure is already a problem; the workload mixes filters across multiple columns
Traditional partitioning	Single, low-cardinality, stable filter column (e.g., daily ingest date with retention boundary aligned to it); pre-existing tooling expects directory layout

The Octopus rebuild sits firmly in the first row: multi-key joins with date ranges, evolving filter patterns across three streams, and pre-existing small-file pressure that fixed partitioning would have worsened.

Promotion note¶

Before this page existed, "liquid clustering" appeared as a tag on half a dozen pages (systems/delta-lake, systems/uc-otel-trace-tables, systems/uc-managed-tables, systems/zerobus-ingest, concepts/lakehouse-native-observability, patterns/telemetry-to-lakehouse, patterns/managed-otel-ingestion-direct-to-lakehouse) but had no dedicated page. The Octopus Energy MHHS source is the first one to quote the small-file-and-over-partitioning rationale verbatim, which justified promotion.

`CLUSTER BY AUTO` — workload-aware key selection¶

The 2026-05-27 BI Serving Pointers source discloses an extension of liquid clustering not previously canonicalised on the wiki: automatic key selection driven by Predictive Optimization.

"If you're not sure which columns to choose, CLUSTER BY AUTO lets Predictive Optimization select keys based on observed query patterns." — BI Serving Pointers

This turns clustering-key selection from an architect-decision-at-table-creation (which historically required predicting the workload) into a workload-driven runtime adaptation: the substrate observes which columns are used most in filter and join predicates, and chooses clustering keys accordingly. The same predictive scheduler that runs auto-OPTIMIZE / VACUUM / stats collection also owns the clustering-key decision.

The BI-workload guidance from the same source: "For BI workloads, cluster on your most common filter and join columns — date keys, region, product category. You can select up to four columns, and if two columns are highly correlated, include only one." Even with CLUSTER BY AUTO, the architectural envelope holds — Liquid Clustering supports up to four keys, and correlated columns should not double-count.

Seen in¶

sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — verbatim disclosure of the over-partitioning rationale; applied "across multiple tables for columns frequently used in filters and joins" in the Octopus three-stream margin data pipeline rebuild. Paired with patterns/broadcast-join-for-small-reference-tables in the "join and partition tuning" optimisation category.
sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco — disclosure of CLUSTER BY AUTO (Predictive-Optimization- driven automatic key selection from observed query patterns) and the four-column / correlation-deduplication architectural envelope. Names liquid clustering as the partition-replacement primitive for the Gold-layer star-schema serving tier: "Liquid clustering replaces static partitioning and manual Z-ORDER — and unlike those approaches, you can redefine clustering keys without rewriting existing data."
sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo — Eight-myth debunking face. The most architecturally dense Liquid Clustering disclosure on the wiki: each of the eight defender-of-partitioning arguments gets paired with the verbatim reality from the Databricks engineering team, and three production case studies (Arctic Wolf 3.8 PB security telemetry; Bolt TB-scale CDC; Databricks-internal 1.1 PB dashboard table) supply operational evidence at PB scale. Canonicalises eight new wiki primitives:
concepts/over-partitioning — "more than 75% of cases" failure rate disclosed.
concepts/file-level-data-skipping — the architectural fact that on modern OTFs "directory-pruning does not exist" (Myth #1 debunked).
concepts/metadata-only-operation — DELETE / COUNT / DISTINCT / GROUP BY computed from per-file min/max stats; "~90% faster" DELETEs, "up to 27×" aggregate speedup (Myth #3 debunked).
concepts/row-level-concurrency — "two writers updating different rows no longer conflict, even if those rows live in the same file"; partition-as-write-boundary is a workaround for an older concurrency model (Myth #6 debunked).
concepts/z-ordering — "poor clustering quality" + "unnecessary rewrites" structural problems (Myth #7 debunked).
concepts/multi-dimensional-clustering — internal 1.1 PB case: clustering on (date, hour, source, id) simultaneously, impossible under partitioning's cardinality limits.
concepts/co-clustered-join — Private Preview shuffle elimination; "~51% faster, 87% less shuffle data".
concepts/low-cardinality-clustering-optimization — automatic per-file=single-low-cardinality-value layout; 35% / 22% benchmark improvement. Pairs with three new wiki patterns:
patterns/clustering-keys-as-engine-input — verbatim framing: "Liquid treats clustering keys as input that the engine uses to guide optimal file organization."
patterns/incremental-clustering-on-write — "Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites."
patterns/in-place-partitioned-to-clustered-conversion — ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY (Private Preview) validated by Bolt's zero-downtime CDC migration.
patterns/replace-using-and-replace-on-for-selective-overwrite — REPLACE USING / ON layout-agnostic, compute-agnostic selective overwrite (Myth #8 debunked). Plus first wiki canonicalisation of the systems/arctic-wolf-security-telemetry-table case (3.8+ PB, 1T+ events/day, 7.7× query speedup post-migration). Operational numbers disclosed: 75%+ over-partitioning rate; 12h → 23m OPTIMIZE planning improvement on 10 PB tables; 5× OPTIMIZE execution speedup on Medium DBSQL clusters; ~90% faster metadata-only DELETEs; 27× aggregate speedup; Arctic Wolf 51s → 6.6s (7.7×) on 90-day queries; Bolt +138% write throughput / −21% avg read time; internal 5.9× speedup with 86% bytes-read reduction and 27% storage shrinkage. Reserved for future ingests: corpus behind the 75% figure, GA timelines for co-clustered joins and Liquid Conversion, the conversion algorithm internals, attribution between Liquid Clustering / Predictive Optimization / UC managed tables across the Arctic Wolf result, the column-selection algorithm behind CLUSTER BY AUTO.