Skip to content

CONCEPT Cited by 2 sources

Multi-dimensional clustering

Definition

Multi-dimensional clustering is the table-layout technique where data files are organised to support simultaneous skipping on multiple columns — so that queries filtering on any subset of the clustering keys benefit from file-level data skipping without committing to a single fixed partition column at table creation. It is the structural alternative to single-column partitioning for workloads where filter predicates evolve or compose across multiple columns.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post canonicalises the architectural payoff through the internal 1.1 PB table case study:

"Adding source and id as partitions wasn't viable, because there were too many distinct values. This would have created billions of tiny files. Liquid Clustering removed the trade-off, allowing clustering on time and the additional identifier columns simultaneously, while maintaining good file sizes."

Multi-dimensional clustering removes the either-or: instead of "partition by date OR partition by source OR partition by id", the table is clustered on (date, hour, source, id) and queries on any subset benefit.

The four-column envelope

The 2026-05-27 Databricks BI Serving Pointers post (Source: sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco) discloses the practical envelope:

"For BI workloads, cluster on your most common filter and join columns — date keys, region, product category. You can select up to four columns, and if two columns are highly correlated, include only one."

Two architectural rules:

  • Up to four clustering keys. The engine optimises for multi-column skipping within this bound. Beyond four, the layout algorithm's effectiveness degrades.
  • Correlation deduplication. Two highly-correlated columns (e.g., country and currency) double-count the same axis; pick one.

How files become multi-dimensionally clustered

The 2026-06-01 source disambiguates between two clustering shapes:

Multi-column with low cardinality on the lead key

"Liquid Clustering automatically detects when to apply low-cardinality optimizations. For example, if you cluster by (date, user_id), and date has low cardinality, the system aims for each file to contain rows from only a single date. Higher-cardinality columns, like user_id, are then automatically used for finer-grained sorting within each date's files, without having to rely on other sorting techniques like Z-Ordering."

Each file is "single-date" by clustering, and within each file rows are sorted by user_id. Two predicates — WHERE date = X and WHERE user_id BETWEEN Y AND Z — both prune effectively. See concepts/low-cardinality-clustering-optimization.

Multi-column with high cardinality on all keys

The internal 1.1 PB case study covers the high-cardinality case. Files contain rows from a multi-column-clustered region of the 4-dimensional (date, hour, source, id) space, sized to the target file-size envelope. Queries filtering on any subset of those columns prune effectively because per-file min/max ranges on each clustered column are tight.

The disclosed outcome: 5.9× speedup, 86% bytes-read reduction, 27% storage reduction on a 1.1 PB table previously partitioned by (date, hour).

Composition with co-clustered joins

Multi-dimensional clustering on the join key enables the forward-looking co-clustered join optimisation (Private Preview as of 2026-06-01). When both join sides are clustered on the same key, the join can be executed without shuffling data — see concepts/co-clustered-join for the architectural mechanism.

The 2026-06-01 source's disclosed envelope: ~51% faster join runtime; 87% less shuffle data (1.2 TiB → 150 GiB) on a real-world data warehousing benchmark Liquid-to-Liquid join.

Why multi-dimensional clustering needs incremental maintenance

A multi-dimensional layout is harder to maintain than a single-column sort: every new write potentially affects the locality on every clustering dimension. Periodic full rewrites (Z-Order's approach) become economically infeasible at scale because each maintenance run rewrites bytes proportional to the table size, not to the new-data volume.

Liquid Clustering's answer: incremental clustering at write time. New writes get placed into files that maintain locality on the clustering dimensions; periodic maintenance compacts and rebalances rather than fully rewriting.

The myth defenders cite

A common defender's argument: "partition by the most-filtered column; Z-Order on the others" (Myth #7 in the 2026-06-01 post). The post's debunking:

"Z-Ordering doesn't save partitioning. In fact, it has its own structural problems" — namely poor clustering quality (per- file min/max ranges are wider than with Liquid) and unnecessary rewrites (full re-Z-Order on every batch of new data, with cost growing linearly with table size).

Multi-dimensional clustering via Liquid is structurally different: keys can be changed without rewrite, layout maintenance is incremental, and high-cardinality keys are first-class.

Sibling layout choices

Choice Multi-dim filter support Maintenance cost Cardinality limits
Single-column partition One column free; rest scan None ongoing; rebuild on column change Low cardinality required
Z-Order on multiple columns All Z-ordered columns; mixed quality Periodic full rewrite Effective skipping on low-cardinality keys
Liquid Clustering on multiple columns All clustered columns; first-class Incremental write-time High-cardinality first-class
CLUSTER BY AUTO All columns observed Predictive Optimization scheduled High-cardinality first-class

Seen in

Last updated · 542 distilled / 1,571 read