SYSTEM Cited by 3 sources
Liquid Clustering¶
Liquid Clustering is a Delta Lake feature that dynamically co-locates related records on user-specified clustering keys without requiring fixed partition boundaries. It replaces the legacy choice between "partition by column X" (which ages badly when the workload changes) and "don't partition" (which forces full-table scans on common predicates). For the wiki it is the partition-replacement primitive that turns up in nearly every recent Databricks-source page in the corpus — UC OTel trace tables, Zerobus-managed tables, UC managed tables under external-engine write, and the Octopus Energy margin data pipeline. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
Why it exists — the over-partitioning trap¶
Traditional table partitioning forces an architect to commit, at table-creation time, to a fixed column choice (date, region, customer-tier, etc.) that defines a directory-shaped layout in object storage. Three failure modes emerge at scale:
| Failure mode | What goes wrong |
|---|---|
| Small-file problem | Each partition becomes a directory; high-cardinality partition columns produce many directories with few small files. Read planning, list operations, and metadata IO all explode. |
| Higher memory consumption | Many small partitions → many partition descriptors held in memory by the query planner → driver-side memory pressure and slower planning. |
| I/O overhead from over-partitioning | When the chosen column is finer-grained than queries need, every read touches more partition boundaries than necessary. |
The Octopus Energy MHHS rebuild names all three explicitly:
"Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning."
What it does¶
Liquid clustering co-locates related records on the clustering keys at storage-layout time without requiring those keys to define directory partitions. The architect specifies which columns matter for filter/join performance; Delta arranges the file layout so that queries on those columns can prune effectively, without committing to a fixed bucketing scheme that ages badly.
Three load-bearing properties for the wiki:
- No fixed partition boundaries. Records are co-located by proximity on the clustering keys; the system can rebalance as data arrives without forcing a re-partition operation.
- Multi-column clustering. Multiple clustering keys can be set; the layout optimises for queries that filter on any of them.
- Compatible with Predictive Optimization. The Databricks managed-storage layer (UC managed tables) refreshes the layout automatically.
Where it shows up on the wiki¶
The recurring shape across Databricks-source pages: liquid clustering is enabled across multiple tables for columns frequently used in filters and joins, replacing what would have historically been partition-by clauses.
| Source / system | Clustering use |
|---|---|
| systems/octopus-margin-data-pipeline | "Liquid clustering was enabled across multiple tables for columns frequently used in filters and joins" — applied across the three-stream pipeline's tables |
| systems/uc-otel-trace-tables | Auto-liquid-clustered post a recent product update — keys not disclosed (likely trace_id / time-based) |
| systems/uc-managed-tables | One of the "managed-table-only properties" preserved across external-engine writes |
| systems/zerobus-ingest | The receiving Delta tables are auto-liquid-clustered |
In the Octopus case, liquid clustering appears in the "Join and partition tuning" category alongside broadcast joins for reference tables <500 MB. The combination — broadcast on the small side, liquid clustering on the large side — is the disclosed recipe for the multi-key joins with date ranges that drive the margin pipeline.
Eight myths debunked (2026-06-01)¶
The 2026-06-01 "Debunking 8 data layout myths" post is the most architecturally dense Liquid Clustering disclosure on the wiki — it addresses each of the eight defender-of-partitioning arguments in turn and supplies operational evidence at PB scale. Summary table:
| Myth | Reality | Wiki canonical |
|---|---|---|
| #1: Partitioning is faster — directories prune without opening files | "Directory-pruning does not exist on modern open table formats like Delta and Iceberg" — pruning is per-file via transaction-log statistics | concepts/file-level-data-skipping |
| #2: Partitioning is better on low-cardinality columns | Liquid auto-applies low-cardinality optimization; per-file=single-low-cardinality-value with high-cardinality nested sort; 35% lower clustering time, 22% faster queries | concepts/low-cardinality-clustering-optimization |
| #3: Liquid Clustering doesn't support metadata-only operations | DELETEs / COUNT / DISTINCT / GROUP BY all supported; ~90% faster metadata-only DELETEs; up to 27× aggregate speedup | concepts/metadata-only-operation |
| #4: Liquid Clustering doesn't work at PB scale | OPTIMIZE planning 12h → 23m on 10 PB tables; execution 5× faster; "dozens of customers now have PB-scale Liquid Clustered tables in production" | (this page) |
| #5: Liquid Clustering only benefits Databricks readers | Write-side optimization producing standard Parquet+stats; "Any compatible reader (e.g. Apache Spark, DuckDB, etc.) can use those stats to skip files" | (this page) |
| #6: Concurrent ETL needs partition-as-write-boundary | Row-level concurrency — "two writers updating different rows no longer conflict, even if those rows live in the same file" | concepts/row-level-concurrency |
| #7: Z-ORDER patches what partitioning misses | Z-ORDER has "poor clustering quality" + "unnecessary rewrites"; Liquid clusters incrementally on write, layout stays optimal without periodic rebuilds | concepts/z-ordering · patterns/incremental-clustering-on-write |
| #8: Selective overwrite needs Dynamic Partition Overwrite | REPLACE USING / REPLACE ON work on any layout (clustered, partitioned, unclustered) and any compute (classic, SQL warehouses, Serverless); atomic, any-column matching | patterns/replace-using-and-replace-on-for-selective-overwrite |
The post canonicalises concepts/over-partitioning as the load-bearing failure mode ("more than 75% of cases" on the Databricks customer base) and the clustering-keys-as-engine-input inversion as the architectural answer.
Production case studies at PB scale (2026-06-01)¶
The 2026-06-01 source supplies three production case studies of the partition→Liquid migration shape:
Arctic Wolf — 3.8+ PB security telemetry¶
- Scale: 3.8+ PB; 1+ trillion events/day
- Workload: Threat hunters running 90-day queries on high-cardinality filter columns (user, host, IP, hash)
- Outcome: 90-day queries 51s → 6.6s (7.7×); file count 4M → 2M; data freshness hours → minutes
- Built on UC managed tables + Predictive Optimization + Delta + Liquid
- Canonicalised at systems/arctic-wolf-security-telemetry-table
Bolt — TB-scale CDC table¶
- Migration: in-place
Liquid Conversion (Private Preview) —
ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY - Outcome: Write throughput +138%; read time −21% avg / −63% max across 9 representative queries; zero downtime alongside live ingestion
- The zero-downtime live-ingestion property validates the in-place conversion pattern on a workload class (CDC) that cannot tolerate cutover downtime.
Databricks-internal — 1.1 PB → 0.8 PB dashboard table¶
- Scale: 1.1 PB → 0.8 PB (−27% storage reduction post-Liquid)
- Pre-migration: partitioned by (date, hour); workload also
filtered on
sourceandidbut those columns were too high-cardinality to partition on - Post-migration: clustered by
(date, hour, source, id)— validates multi-dimensional clustering on mixed cardinalities - Outcome: Wall clock 406s → 70s (5.9×); bytes read 3.5 TB → 0.48 TB (−86%); table size −27%
The internal case discloses the structural reason the migration paid off: queries needed multi-dimensional filter pruning that partitioning's single-column choice couldn't deliver.
OPTIMIZE engineering improvements (2026-06-01)¶
The 2026-06-01 source quantifies the engineering work behind PB-scale viability. Two-phase OPTIMIZE breakdown:
| Phase | Pre-improvement worst case | Post-improvement | Improvement |
|---|---|---|---|
| Planning (phase 1) | Up to 12 hours on 10 PB tables | 23 minutes | 31× |
| Execution (phase 2) | Baseline | 5× faster on Medium DBSQL clusters | 5× |
This is the engineering disclosure behind the "Liquid Clustering works at PB scale" claim. Without these improvements, the periodic-OPTIMIZE-cycle cost on PB tables was prohibitive; the post-improvement envelope (23-minute planning, 5× faster execution) makes daily-or-better OPTIMIZE cycles tractable.
Forward: co-clustered joins (Private Preview, 2026-06-01)¶
The 2026-06-01 source discloses co-clustered joins as a Private Preview primitive: when both join sides are Liquid-clustered on the join key, shuffle is eliminated. Disclosed envelope on a real-world data warehousing benchmark: ~51% faster wall clock (28 minutes → 14 minutes) and 87% less shuffle data (1.2 TiB → 150 GiB).
Architecturally significant because it makes the multi-dimensional clustering payoff compound across both filter and join workloads. The same column-choice decision that supports filter pruning also supports join shuffle elimination, provided the join key is among the clustering keys.
When to choose liquid clustering vs traditional partitioning¶
| Choose | When |
|---|---|
| Liquid clustering | Filter/join columns evolve over time; cardinality is too high for clean directory partitioning; small-file pressure is already a problem; the workload mixes filters across multiple columns |
| Traditional partitioning | Single, low-cardinality, stable filter column (e.g., daily ingest date with retention boundary aligned to it); pre-existing tooling expects directory layout |
The Octopus rebuild sits firmly in the first row: multi-key joins with date ranges, evolving filter patterns across three streams, and pre-existing small-file pressure that fixed partitioning would have worsened.
Promotion note¶
Before this page existed, "liquid clustering" appeared as a tag on half a dozen pages (systems/delta-lake, systems/uc-otel-trace-tables, systems/uc-managed-tables, systems/zerobus-ingest, concepts/lakehouse-native-observability, patterns/telemetry-to-lakehouse, patterns/managed-otel-ingestion-direct-to-lakehouse) but had no dedicated page. The Octopus Energy MHHS source is the first one to quote the small-file-and-over-partitioning rationale verbatim, which justified promotion.
CLUSTER BY AUTO — workload-aware key selection¶
The 2026-05-27 BI Serving Pointers source discloses an extension of liquid clustering not previously canonicalised on the wiki: automatic key selection driven by Predictive Optimization.
"If you're not sure which columns to choose,
CLUSTER BY AUTOlets Predictive Optimization select keys based on observed query patterns." — BI Serving Pointers
This turns clustering-key selection from an
architect-decision-at-table-creation (which historically
required predicting the workload) into a workload-driven
runtime adaptation: the substrate observes which columns are
used most in filter and join predicates, and chooses clustering
keys accordingly. The same predictive scheduler that runs
auto-OPTIMIZE / VACUUM / stats collection also owns the
clustering-key decision.
The BI-workload guidance from the same source: "For BI
workloads, cluster on your most common filter and join columns —
date keys, region, product category. You can select up to four
columns, and if two columns are highly correlated, include only
one." Even with CLUSTER BY AUTO, the architectural envelope
holds — Liquid Clustering supports up to four keys, and
correlated columns should not double-count.
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — verbatim disclosure of the over-partitioning rationale; applied "across multiple tables for columns frequently used in filters and joins" in the Octopus three-stream margin data pipeline rebuild. Paired with patterns/broadcast-join-for-small-reference-tables in the "join and partition tuning" optimisation category.
- sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco
— disclosure of
CLUSTER BY AUTO(Predictive-Optimization- driven automatic key selection from observed query patterns) and the four-column / correlation-deduplication architectural envelope. Names liquid clustering as the partition-replacement primitive for the Gold-layer star-schema serving tier: "Liquid clustering replaces static partitioning and manual Z-ORDER — and unlike those approaches, you can redefine clustering keys without rewriting existing data." - sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo — Eight-myth debunking face. The most architecturally dense Liquid Clustering disclosure on the wiki: each of the eight defender-of-partitioning arguments gets paired with the verbatim reality from the Databricks engineering team, and three production case studies (Arctic Wolf 3.8 PB security telemetry; Bolt TB-scale CDC; Databricks-internal 1.1 PB dashboard table) supply operational evidence at PB scale. Canonicalises eight new wiki primitives:
- concepts/over-partitioning — "more than 75% of cases" failure rate disclosed.
- concepts/file-level-data-skipping — the architectural fact that on modern OTFs "directory-pruning does not exist" (Myth #1 debunked).
- concepts/metadata-only-operation — DELETE / COUNT / DISTINCT / GROUP BY computed from per-file min/max stats; "~90% faster" DELETEs, "up to 27×" aggregate speedup (Myth #3 debunked).
- concepts/row-level-concurrency — "two writers updating different rows no longer conflict, even if those rows live in the same file"; partition-as-write-boundary is a workaround for an older concurrency model (Myth #6 debunked).
- concepts/z-ordering — "poor clustering quality" + "unnecessary rewrites" structural problems (Myth #7 debunked).
- concepts/multi-dimensional-clustering — internal 1.1 PB
case: clustering on
(date, hour, source, id)simultaneously, impossible under partitioning's cardinality limits. - concepts/co-clustered-join — Private Preview shuffle elimination; "~51% faster, 87% less shuffle data".
- concepts/low-cardinality-clustering-optimization — automatic per-file=single-low-cardinality-value layout; 35% / 22% benchmark improvement. Pairs with three new wiki patterns:
- patterns/clustering-keys-as-engine-input — verbatim framing: "Liquid treats clustering keys as input that the engine uses to guide optimal file organization."
- patterns/incremental-clustering-on-write — "Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites."
- patterns/in-place-partitioned-to-clustered-conversion —
ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY(Private Preview) validated by Bolt's zero-downtime CDC migration. - patterns/replace-using-and-replace-on-for-selective-overwrite
— REPLACE USING / ON layout-agnostic, compute-agnostic
selective overwrite (Myth #8 debunked).
Plus first wiki canonicalisation of the
systems/arctic-wolf-security-telemetry-table case (3.8+ PB,
1T+ events/day, 7.7× query speedup post-migration). Operational
numbers disclosed: 75%+ over-partitioning rate; 12h → 23m OPTIMIZE
planning improvement on 10 PB tables; 5× OPTIMIZE execution
speedup on Medium DBSQL clusters; ~90% faster metadata-only
DELETEs; 27× aggregate speedup; Arctic Wolf 51s → 6.6s (7.7×) on
90-day queries; Bolt +138% write throughput / −21% avg read time;
internal 5.9× speedup with 86% bytes-read reduction and 27%
storage shrinkage. Reserved for future ingests: corpus behind
the 75% figure, GA timelines for co-clustered joins and Liquid
Conversion, the conversion algorithm internals, attribution
between Liquid Clustering / Predictive Optimization / UC managed
tables across the Arctic Wolf result, the column-selection
algorithm behind
CLUSTER BY AUTO.
Related¶
- Systems: systems/delta-lake · systems/octopus-margin-data-pipeline · systems/uc-managed-tables · systems/uc-otel-trace-tables · systems/zerobus-ingest · systems/databricks-predictive-optimization · systems/arctic-wolf-security-telemetry-table
- Concepts: concepts/partition-strategy · concepts/partition-skew-data-skew · concepts/partition-pruning · concepts/automatic-table-optimization · concepts/over-partitioning · concepts/file-level-data-skipping · concepts/metadata-only-operation · concepts/row-level-concurrency · concepts/z-ordering · concepts/multi-dimensional-clustering · concepts/co-clustered-join · concepts/low-cardinality-clustering-optimization
- Patterns: patterns/grain-aligned-stream-split · patterns/broadcast-join-for-small-reference-tables · patterns/managed-table-as-default-storage-layer · patterns/clustering-keys-as-engine-input · patterns/incremental-clustering-on-write · patterns/in-place-partitioned-to-clustered-conversion · patterns/replace-using-and-replace-on-for-selective-overwrite