Skip to content

CONCEPT Cited by 1 source

Z-Ordering

Definition

Z-Ordering is a multi-dimensional clustering technique on Delta Lake that uses a space-filling curve (the Z-order curve) to map multiple-column key tuples onto a linear ordering, then rewrites table files so rows close in that ordering live in the same files. The intended outcome: queries filtering on any of the Z-ordered columns benefit from file-level data skipping because the per-file min/max ranges on those columns become narrower.

Z-Ordering predates Liquid Clustering and was the canonical multi-column clustering mechanism on Databricks before Liquid arrived. The 2026-06-01 Databricks "Debunking 8 data layout myths" post is the wiki's canonical disclosure of its structural problems and the case for Liquid Clustering as its replacement.

The user-facing surface

OPTIMIZE my_table ZORDER BY (date, region, customer_id)

The command rewrites the table's files in Z-order on the named columns. Subsequent queries filtering on any of those columns get better data-skipping than they would on randomly-laid-out files.

The two structural problems

The 2026-06-01 Databricks post addresses Myth #7 ("Z-Ordering makes up for partitioning's shortcomings") by enumerating two failure modes that are structural, not implementation bugs:

Problem 1: Poor clustering quality

"poor clustering quality. Z-Order doesn't maintain a true ordering across the table. Values for the same column can get spread across many files, so per-file min/max ranges are wider and queries skip fewer files than they would with Liquid."

The Z-order curve compromises between multiple dimensions. A query filtering on a single column gets weaker pruning than that column alone would deliver, because the layout is optimised for multi-column locality across the curve. In practice this means:

  • Per-file min/max ranges on each Z-ordered column are wider than they would be with sort-by-column-only.
  • Data skipping prunes fewer files per predicate.
  • The benefit on multi-predicate queries (combining two or three Z-ordered columns) is real but smaller than naive intuition suggests.

Problem 2: Unnecessary rewrites

"unnecessary rewrites. Z-Order has to be rerun periodically as new data lands, and each rerun rewrites large amounts of old, possibly already-clustered data to restore clustering quality. With continuous ingestion, the cost of keeping data well-clustered with Z-Order grows along with the table."

Each OPTIMIZE ZORDER BY rewrites entire files (at minimum) or entire partitions worth of files (in practice) — including data that was already correctly clustered before the new ingest. As the table grows and the rewrite cost grows linearly with table size while the new-data fraction shrinks, Z-Order becomes economically less viable at scale.

This is explicit write amplification: the bytes physically rewritten on each maintenance run far exceed the bytes of newly-ingested data the maintenance is responding to.

Why it became a load-bearing critique

The Databricks post frames Z-Order's structural problems specifically as a counter to defenders who claim "partitioning + ZORDER covers all the cases":

Myth: "Partitioning handles the partition column's filters, and Z-Ordering handles the rest. By running OPTIMIZE ZORDER BY, the engine sorts data for optimal skipping on filters that don't align with the partition scheme."

The reality, per the source: "Z-Ordering doesn't save partitioning. In fact, it has its own structural problems." The defender's claim that Z-Order patches partitioning's column-rigidity ("partition by date, ZORDER by everything else") carries Z-Order's structural costs into production while not actually fixing partitioning's column-choice rigidity at scale.

Liquid Clustering's contrast

The 2026-06-01 source's framing of why Liquid Clustering doesn't share Z-Order's two problems:

"Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites."

Two structural differences:

Property Z-Ordering Liquid Clustering
Maintenance trigger Periodic, full re-rewrite Incremental, write-time
Rewrite cost O(table size) per rerun O(new data) per write
Clustering quality Compromised across multi-column curve Per-key-list optimised; auto-applies low-cardinality optimisations
Cardinality limits Effective clustering quality degrades on high-cardinality keys Designed to handle high-cardinality keys ("always tries to create files of a good size")

The 2026-06-01 source positions the upgrade verbatim: "Liquid clustering replaces static partitioning and manual Z-ORDER — and unlike those approaches, you can redefine clustering keys without rewriting existing data" (Source: sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco).

When Z-Ordering remains workable

The 2026-06-01 source does not enumerate this directly, but architectural reasoning suggests:

  • Small static tables. A table that doesn't grow doesn't pay the rewrite-cost-with-table-size tax.
  • Tables with infrequent updates. Periodic Z-Order on a monthly-rebuilt table costs once per rebuild, not continuously.
  • Legacy pipelines that can't migrate. Pre-existing pipelines with deep dependencies on OPTIMIZE ZORDER BY may not be cheaply convertible.

For greenfield Lakehouse work and high-write-velocity tables, the 2026-06-01 source's implicit prescription is: don't pick Z-Order; pick Liquid Clustering.

Sibling clustering / pruning mechanisms

Mechanism Where it lives Trade-off
Hive partitioning Filesystem directories Rigid column choice; over-partitioning trap
Z-Ordering Per-file rewrite via OPTIMIZE Periodic rewrite cost grows with table
Liquid Clustering Incremental write-time layout Engine-managed; flexible keys
Sort-by-single-column Per-file row sort Strong skipping on one column; weak on others
Bloom filters Per-file probabilistic index Set-membership only; not range

The progression — partitioning → Z-Order → Liquid Clustering — moves toward engine-owned, incremental, multi-dimensional layout. Each step hands more control to the engine and removes a structural cost from the user.

Seen in

  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfoFirst wiki canonicalisation as a deprecated-in-favour-of-Liquid multi-dimensional clustering technique. Names the two structural problems verbatim (poor clustering quality, unnecessary rewrites) and positions Liquid Clustering's incremental-write-time layout as the structural fix. Frames Z-Ordering's combination with partitioning ("partition by date, ZORDER by everything else") as carrying Z-Order's costs forward without fixing partitioning's column-rigidity at scale.
Last updated · 542 distilled / 1,571 read