Skip to content

CONCEPT Cited by 1 source

Low-cardinality clustering optimization

Definition

The low-cardinality clustering optimization is the layout specialisation Liquid Clustering applies when one of the clustering keys has low cardinality: each data file is laid out to contain rows from only a single value of the low-cardinality key, while higher-cardinality keys provide finer-grained sorting within each file.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post canonicalises the optimisation:

"Liquid Clustering automatically detects when to apply low-cardinality optimizations. For example, if you cluster by (date, user_id), and date has low cardinality, the system aims for each file to contain rows from only a single date. Higher- cardinality columns, like user_id, are then automatically used for finer-grained sorting within each date's files, without having to rely on other sorting techniques like Z-Ordering."

The key word is automatically: the architect declares the clustering keys; Liquid detects which are low-cardinality and applies the per-file=single-value layout without explicit direction.

Why this is structurally interesting

The low-cardinality optimisation is the layout shape that partitioning was reaching for. When teams partition by date, they're aiming for "each directory contains a single date; each file in that directory contains rows from that single date" — a specific case of per-file=single-low-cardinality-value.

The difference with Liquid Clustering is the mechanism:

Property Hive PARTITION BY date Liquid CLUSTER BY (date, user_id) with low-cardinality date
Per-file value layout One value per file (forced by directory) One value per file (auto-detected)
Within-file sort None (or manually sorted) Auto-sorted on user_id
Cardinality flexibility Low cardinality required Mixed cardinalities composable
Layout change Full rewrite Change CLUSTER BY keys; incremental
File-size guarantee Per-file size depends on partition value distribution Engine targets file-size envelope

The 2026-06-01 source's framing: Liquid Clustering subsumes the partitioning shape on the low-cardinality dimension while adding within-file sorting on the high-cardinality dimension — a layout that combines the best of partitioning's value-isolation with Z-Ordering's intra-file ordering.

The performance disclosure

The 2026-06-01 source's specific benchmark numbers for the low-cardinality optimisation:

"We saw the following improvements while benchmarking this Liquid optimization on a real-world data warehousing benchmark: 35% lower time for clustering and 22% faster query times."

Two-axis improvement:

  • 35% faster clustering. The optimisation is not just a query-time win — file layout itself is faster to compute when the engine can use the low-cardinality key as a top-level bucket.
  • 22% faster queries. Per-file min/max ranges on the low-cardinality column become trivial (one value per file), and per-file sort on the high-cardinality column tightens its min/max ranges.

When it kicks in

The 2026-06-01 source uses "the system aims for" — implying the optimisation is a soft target, not a hard guarantee. Practical considerations:

  • File-size lower bound. If a single low-cardinality value has too few rows for one file, multiple values may share a file.
  • Skewed value distributions. A low-cardinality column with one dominant value (e.g., country = 'US' for a US-centric dataset) may produce many files for the dominant value while other values share files.
  • Cardinality threshold. The source doesn't disclose the threshold at which the optimisation activates. Reasonable presumption: ~hundreds-to-low-thousands of distinct values is "low cardinality"; millions is not.

Composition with multi-dimensional clustering

The low-cardinality optimisation is how Liquid Clustering handles the most common BI workload pattern: cluster on (date, region, customer_id) where date is bounded by retention (low cardinality in steady state), region is small fixed enum (low cardinality), customer_id is high-cardinality.

The optimisation produces: - Per-file = single-(date, region) tuple - Within-file rows sorted by customer_id - Multi-dimensional skipping: predicates on date / region / customer_id all prune effectively

This is the structural mechanism behind Liquid Clustering's claim that "clustering on a high-cardinality column" works well — the high-cardinality column's role is in-file sort, not per-file isolation. See concepts/multi-dimensional-clustering for the broader composition principle.

Failure modes

  • All keys high-cardinality. If no clustering key is low- cardinality, the optimisation doesn't apply; per-file layout uses the standard multi-dimensional clustering algorithm.
  • Low-cardinality key correlated with file-size target. If the per-low-cardinality-value row count is far smaller than the file-size target, files will mix multiple low-cardinality values to hit file-size targets — the optimisation degrades gracefully.
  • High-cardinality column with no useful sort order. If the "higher-cardinality column" has no semantic ordering useful to queries, the within-file sort produces no skipping benefit.

Seen in

  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfoFirst wiki canonicalisation. Names the "per-file=single- date" layout shape and the auto-detection / auto-sort mechanism. Discloses the 35% / 22% benchmark improvement on a real-world data warehousing benchmark. Reserved for future ingests: the cardinality threshold for activation, the skewed-distribution behaviour, the file-size minimum that overrides per-value isolation.
Last updated · 542 distilled / 1,571 read