Debunking 8 data layout myths: why Liquid Clustering outperforms partitioning¶

Summary¶

Databricks blog post (Tier-3, 2026-06-01) that debunks eight persistent myths about Liquid Clustering vs. Hive-style partitioning on Lakehouse tables, then walks three production case studies of partition→liquid migrations at scale and previews two forward-looking primitives (co-clustered joins and in-place Liquid Conversion). The architectural through-line: on modern open table formats like Delta and Iceberg, partition pruning never existed the way the myth claims — pruning has always happened against per-file statistics in the transaction log, not against directory structure — so the "directory shortcut" defenders of partitioning invoke is fictional. Once that's accepted, every supposed advantage of partitioning collapses to a specific case Liquid Clustering handles better: low-cardinality columns, metadata-only operations, petabyte scale, non-Databricks readers, concurrent ETL, Z-Ordering, and selective overwrites. The post discloses concrete operational numbers (75%+ over-partitioning rate in observed cases; 22% query speedup on a benchmark; 27× metadata-only-aggregate speedup; 90% faster DELETEs; 5× OPTIMIZE execution improvement; 7.7× speedup on Arctic Wolf's 3.8 PB security telemetry table; 138% write throughput increase at Bolt; 5.9× speedup + 27% storage reduction on an internal 1.1 PB table).

Key takeaways¶

Hive-style partitioning over-partitions in 75%+ of observed cases. Verbatim: "in our analysis, Hive-style partitioning leads to over-partitioning and small-file problems in more than 75% of cases." The over-partitioning trap is not a usage error but a structural property of forcing column-choice-at-table-creation (see concepts/over-partitioning). The architect commits to a physical layout in the file structure; cardinality mistakes produce billions of tiny files; column-choice mistakes slow queries instead of speeding them up; "either way, you're stuck rewriting the table."
Directory-pruning is a myth on modern open table formats. Verbatim debunking of Myth #1: "Directory-pruning does not exist on modern open table formats like Delta and Iceberg. Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure. The engine never lists directories to plan a query." This is the load-bearing architectural fact behind the entire post — once it's accepted, the case for partitioning at the storage-layout layer evaporates. Canonicalised at concepts/file-level-data-skipping.
Liquid Clustering automatically applies low-cardinality optimizations. Myth #2 debunked: when clustering by (date, user_id) with low-cardinality date, "the system aims for each file to contain rows from only a single date. Higher-cardinality columns, like user_id, are then automatically used for finer-grained sorting within each date's files, without having to rely on other sorting techniques like Z-Ordering." Disclosed benchmarks: "35% lower time for clustering and 22% faster query times" on a real-world data warehousing benchmark. Canonicalised at concepts/low-cardinality-clustering-optimization.
Metadata-only operations work on Liquid Clustered tables. Myth #3 debunked: "Liquid Clustering also supports metadata-only operations including DELETEs, COUNT, DISTINCT, and GROUP BY queries. The engine uses the same per-file min/max stats it uses for data skipping to determine when a query's answer can be computed from metadata alone." Numbers: "metadata-only DELETEs on Liquid Clustered tables ran ~90% faster than full-rewrite DELETEs. Other metadata-only aggregate queries saw up to 27× speedups." The mechanism — per-file min/max statistics — simultaneously powers data skipping (file pruning) AND metadata-only computation (no file reads at all). See concepts/metadata-only-operation.
OPTIMIZE planning improved 12 hours → 23 minutes; execution 5× faster. Myth #4 (PB-scale infeasibility) debunked with specific engineering wins: "Two years ago, planning, the first phase of OPTIMIZE, could take up to 12 hours on a 10 PB Liquid table in some cases. We've spent the time since reducing planning time down to 23 minutes. Execution, the second phase of OPTIMIZE, got 5× faster on a Medium DBSQL cluster." The headline data point: "dozens of customers now have PB-scale Liquid Clustered tables in production."
Liquid Clustering is a write-side optimization — any reader benefits. Myth #5 debunked: "Liquid Clustering is a write-side optimization. It's how the engine organizes files for efficient data skipping. The output is standard Parquet files with min/max stats, written into open table formats like Delta/Iceberg. Any compatible reader (e.g. open-source Apache Spark, DuckDB, etc.) can use those stats to skip files." The architectural framing is that Liquid Clustering is how files are arranged on write, not a runtime feature — so file-skipping benefit accrues to any reader of the resulting Parquet+stats artifacts.
Row-level concurrency removes one of the structural reasons teams partitioned tables. Myth #6 debunked: "Operating at partition granularity was a workaround for an older concurrency model. Unlike partitioning which only has file-level concurrency, Liquid provides row-level concurrency. Two writers updating different rows no longer conflict, even if those rows live in the same file. This removes one of the main reasons teams partitioned tables: maintaining write boundaries to avoid serialization." Canonicalised at concepts/row-level-concurrency.
Z-Ordering doesn't save partitioning — it has its own structural problems. Myth #7 debunked across two verbatim flaws of Z-Order:
"poor clustering quality. Z-Order doesn't maintain a true ordering across the table. Values for the same column can get spread across many files, so per-file min/max ranges are wider and queries skip fewer files than they would with Liquid."
"unnecessary rewrites. Z-Order has to be rerun periodically as new data lands, and each rerun rewrites large amounts of old, possibly already-clustered data to restore clustering quality. With continuous ingestion, the cost of keeping data well-clustered with Z-Order grows along with the table." Liquid's contrasting property: "Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites." See concepts/z-ordering + patterns/incremental-clustering-on-write.
Selective overwrites work on Liquid via REPLACE USING / ON. Myth #8 debunked: REPLACE USING / REPLACE ON support selective overwrites on any data layout (Liquid clustered, partitioned, or plain unclustered) and run on any compute (classic clusters, SQL warehouses, Serverless), unlike Dynamic Partition Overwrite which "requires a Spark config" and is partitioning-specific. "The operation is atomic and matches on any column you choose." See patterns/replace-using-and-replace-on-for-selective-overwrite.

Three production case studies confirm the partition→liquid migration shape. Concrete operational data points:

Case	Scale	Outcome
Arctic Wolf security telemetry	3.8+ PB, 1+ trillion events/day	90-day queries 51s → 6.6s (7.7×); file count 4M → 2M; freshness hours → minutes
Bolt CDC table	TB-scale	Write throughput +138%; read times −21% avg (up to −63%); zero-downtime in-place conversion
Databricks internal dashboard table	1.1 PB → 0.8 PB	Wall clock 406s → 70s (5.9×); bytes read 3.5 TB → 0.48 TB (−86%); table size −27%

The internal case discloses a load-bearing architectural pattern: the partition design (date+hour) was "incomplete" because real queries also filtered on source and id, but adding those as partitions wasn't viable — "there were too many distinct values. This would have created billions of tiny files. Liquid Clustering removed the trade-off, allowing clustering on time and the additional identifier columns simultaneously, while maintaining good file sizes." Canonicalises concepts/multi-dimensional-clustering.

Co-clustered joins (Private Preview) eliminate shuffle. Forward-looking disclosure: "Today, joining Liquid tables on their clustering columns can require a full data shuffle, even when the data is already organized by those columns. Co-clustered joins (now in Private Preview) remove that shuffle automatically. On a real-world data warehousing benchmark, a Liquid-to-Liquid join ran ~51% faster (28 minutes → 14 minutes) and shuffled 87% less data (1.2 TiB → 150 GiB) than the same query without the optimization." Canonicalised at concepts/co-clustered-join.
In-place Liquid Conversion (Private Preview) eliminates the rewrite. Forward-looking disclosure: "converting a partitioned table to Liquid Clustering required a full table rewrite and downstream breaking changes with REPLACE TABLE or a cutover with dual writes and planned downtime. We're introducing a new command (now in Private Preview) that makes this conversion easier, minimizing both downtime and rewrites." Bolt's case validates the operational property: "we ran the conversion from partitioning alongside live ingestion with zero downtime". The SQL surface: ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY. Canonicalised at patterns/in-place-partitioned-to-clustered-conversion.

Architecture and operational numbers¶

Datum	Value	Source quote
Over-partitioning rate (analysed cases)	>75%	"Hive-style partitioning leads to over-partitioning and small-file problems in more than 75% of cases."
Low-cardinality benchmark — clustering time	−35%	"35% lower time for clustering and 22% faster query times"
Low-cardinality benchmark — query time	−22%	(same)
Metadata-only DELETE speedup	~90% faster	"metadata-only DELETEs on Liquid Clustered tables ran ~90% faster than full-rewrite DELETEs"
Metadata-only aggregate speedup	up to 27×	"Other metadata-only aggregate queries saw up to 27× speedups"
OPTIMIZE planning on 10 PB table	12h → 23m	"reducing planning time down to 23 minutes"
OPTIMIZE execution speedup	5×	"Execution, the second phase of OPTIMIZE, got 5× faster on a Medium DBSQL cluster"
Arctic Wolf table scale	3.8+ PB, 1T+ events/day	"runs a 3.8+ PB security telemetry table ingesting 1+ trillion events per day"
Arctic Wolf 90-day query latency	51s → 6.6s (7.7×)	"90-day queries drop from 51 seconds to 6.6 seconds"
Arctic Wolf file count	4M → 2M	"File count dropped from 4M to 2M"
Arctic Wolf data freshness	hours → minutes	"Data freshness improved from hours to minutes"
Bolt write throughput	+138%	"Write throughput (rows/sec) increased by 138%"
Bolt read time avg / max	−21% avg / −63% max	"Read times were reduced by up to 63%, with an average of 21% reduction across 9 representative queries"
Bolt downtime during conversion	zero	"we ran the conversion from partitioning alongside live ingestion with zero downtime"
Internal table size	1.1 PB → 0.8 PB (−27%)	"Total size dropped from 1.1 PB to 0.8 PB, a 27% reduction with no change in the underlying data"
Internal wall clock (16 queries)	406s → 70s (5.9×)	"Wall Clock Time: 406s → 70s — 5.9x speedup"
Internal bytes read	3.5 TB → 0.48 TB (−86%)	"Bytes Read: 3.5 TB → 0.48 TB — 86% fewer bytes read"
Co-clustered join speedup (preview)	~51% faster	"Liquid-to-Liquid join ran ~51% faster (28 minutes → 14 minutes)"
Co-clustered join shuffle reduction	−87% (1.2 TiB → 150 GiB)	"shuffled 87% less data (1.2 TiB → 150 GiB)"

Systems extracted¶

systems/liquid-clustering — central system; eight myths debunked.
systems/delta-lake — "Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure." — the load-bearing architectural fact for the whole post.
systems/apache-iceberg — named alongside Delta as the modern open table formats where directory-pruning is myth.
systems/databricks-predictive-optimization — Arctic Wolf's enabling substrate for managed-table auto-clustering / auto-stats / auto-OPTIMIZE.
systems/uc-managed-tables — Arctic Wolf's table substrate ("Unity Catalog managed tables with Predictive Optimization").
systems/arctic-wolf-security-telemetry-table — production 3.8 PB+ case study; 1T+ events/day.
systems/apache-spark — named as one of the "compatible readers" for Liquid Clustering's standard-Parquet-with-stats output.
systems/duckdb — named as another "compatible reader" for the same output.

Concepts extracted¶

concepts/over-partitioning — "more than 75% of cases"; the load-bearing failure mode the post canonicalises against.
concepts/file-level-data-skipping — "pruning happens against those statistics, not the directory structure".
concepts/metadata-only-operation — DELETE / COUNT / DISTINCT / GROUP BY computed from per-file min/max stats without scanning data files.
concepts/row-level-concurrency — "Two writers updating different rows no longer conflict, even if those rows live in the same file".
concepts/z-ordering — "poor clustering quality" + "unnecessary rewrites"; the older sibling Liquid Clustering supersedes.
concepts/multi-dimensional-clustering — "clustering on time and the additional identifier columns simultaneously" in the internal 1.1 PB case.
concepts/co-clustered-join — shuffle elimination when both sides are clustered on the join key.
concepts/low-cardinality-clustering-optimization — automatic per-file=single-low-cardinality-value layout, with higher-cardinality keys nested.
concepts/small-file-problem-on-object-storage — sibling pathology paired with over-partitioning.
concepts/write-amplification — Z-Order's "unnecessary rewrites" are explicit write-amplification.

Patterns extracted¶

patterns/clustering-keys-as-engine-input — "Liquid treats clustering keys as input that the engine uses to guide optimal file organization. Keys can be changed at any time"; intent separated from layout.
patterns/incremental-clustering-on-write — "Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites".
patterns/in-place-partitioned-to-clustered-conversion — Bolt's ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY (Private Preview); zero-downtime.
patterns/replace-using-and-replace-on-for-selective-overwrite — substitute for Dynamic Partition Overwrite; works on any data layout and any compute.

Caveats¶

Tier-3 source on borderline product-marketing scope. The post is structured as "X common myths debunked" — a recognised consultative-listicle shape — and is fundamentally a Liquid Clustering pitch. It passes scope decisively because the technical content is dense (transaction-log-based pruning mechanism, per-file min/max statistics powering both file skipping and metadata-only operations, OPTIMIZE engineering improvements with specific 12h→23min planning numbers, three production case studies with concrete numbers, structural critique of Z-Order rewrites, row-level vs file-level concurrency distinction, co-clustered join shuffle elimination) and offers several first-class wiki primitives not previously canonicalised (over-partitioning as a named pathology, file-level data skipping as a wiki concept, metadata-only operations as a class, row-level concurrency as a wiki concept, Z-ordering as a wiki concept disambiguated against liquid clustering).
The 75% over-partitioning figure is unscoped. "in our analysis" — corpus, methodology, and what counts as "over-partitioning" are not disclosed. The number is plausible given operational experience but should be treated as Databricks' customer-base observation, not an industry statistic.
The 22% / 35% benchmark numbers are unscoped. "a real-world data warehousing benchmark" — benchmark identity not disclosed (TPC-DS is the safe guess). Workload axes, query mix, and dataset scale are not enumerated.
The 27× metadata-only speedup is "up to" — distribution unknown. Specific query shapes, dataset characteristics, and what's typical vs upper-tail are not disclosed.
The OPTIMIZE 12h→23min disclosure is missing context. Specifically: "on a 10 PB Liquid table in some cases" — the cases where 12h was reachable, the algorithmic improvements that brought it down, and the worst-case behaviour at the new envelope are not disclosed.
Co-clustered joins are Private Preview. GA timeline, eligible workload classes, predicate-pushdown interaction, and behaviour under skewed clustering keys not disclosed.
In-place Liquid Conversion is Private Preview. SQL surface shown (ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY); GA timeline, conversion-time write-amplification envelope, reversibility, and behaviour on tables with foreign-key-style partition designs not disclosed.
Arctic Wolf numbers tie to Predictive Optimization. "After migrating from partitioning to Liquid Clustering on Unity Catalog managed tables with Predictive Optimization" — the headline numbers reflect the combination of Liquid Clustering + auto-stats
auto-OPTIMIZE + auto-VACUUM, not Liquid Clustering alone. Attribution between the four substrate properties is not disclosed.
The internal 1.1 PB→0.8 PB shrinkage is interesting and underexplained. "Better-clustered files compress more efficiently, and the small-file tax that comes with over-partitioning disappears" — qualitative explanation only; the relative contributions of compression vs deduplicated metadata vs lower file-overhead are not quantified.
Bolt's −63% read-time max is a single query. Average is −21% across 9 queries; the spread (and the queries with smaller improvements or regressions) is not enumerated.
No comparison data against Iceberg-native clustering / file pruning. The post asserts that Liquid is a write-side optimization that produces standard Parquet+stats consumable by any reader, but does not benchmark Liquid-on-Delta against equivalent Iceberg-native layouts on the same workload.
Myth #1 is overstated. "The engine never lists directories to plan a query" — strictly true on modern Delta/Iceberg, but some legacy paths and certain partition-discovery operations still touch directory listings. The post elides these for clarity.
Z-Order critique is asymmetric. Z-Order is correctly characterised as having clustering-quality and rewrite issues at scale, but the post elides the cases where Z-Order on partitioned tables remains a workable choice (small tables, infrequent updates, legacy pipelines that can't migrate) — the implicit prescription "never Z-Order" is broader than the disclosed evidence supports.
No QPS / concurrency / lock-contention numbers for any of the three case studies; "row-level concurrency" is named verbatim but the actual concurrent-writer scaling envelope on the named customers is not disclosed.