Skip to content

CONCEPT Cited by 1 source

Co-clustered join

Definition

A co-clustered join is a join optimisation that eliminates data shuffle when both join sides are clustered on the join key. The query planner recognises that physical files on each side already contain co-locatable rows on the join column, so partitions can be joined locally without redistributing data across the cluster's network.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post canonicalises the optimisation as a forward-looking primitive on Liquid Clustering (Private Preview):

"Today, joining Liquid tables on their clustering columns can require a full data shuffle, even when the data is already organized by those columns. Co-clustered joins (now in Private Preview) remove that shuffle automatically. On a real-world data warehousing benchmark, a Liquid-to-Liquid join ran ~51% faster (28 minutes → 14 minutes) and shuffled 87% less data (1.2 TiB → 150 GiB) than the same query without the optimization."

The mechanism

Without co-clustered join, a join on a clustered key still incurs the standard shuffle:

Table A (Liquid clustered on customer_id)
Table B (Liquid clustered on customer_id)

Standard execution:
1. Read Table A → repartition by hash(customer_id) → broadcast
2. Read Table B → repartition by hash(customer_id) → broadcast
3. Co-located join across cluster
4. Result: full shuffle of both A and B even though both were
   already organised on customer_id

With co-clustered join, the planner recognises both sides are already organised on the join key:

Table A (Liquid clustered on customer_id)
Table B (Liquid clustered on customer_id)

Co-clustered execution:
1. Read Table A → preserve clustering layout
2. Read Table B → preserve clustering layout
3. Pair up files from A and B whose customer_id ranges overlap
4. Local join per-file-pair
5. Result: ~zero shuffle; only metadata-level coordination

The planner-level requirement: detect that both tables share the same clustering key and the layouts are compatible enough to pair files.

The disclosed performance envelope

Metric Without co-clustered join With co-clustered join Improvement
Wall clock 28 minutes 14 minutes ~51% faster
Shuffle data 1.2 TiB 150 GiB 87% reduction

The shuffle reduction is the load-bearing number: shuffle-bound joins at PB scale dominate query cost, so eliminating the shuffle (or reducing it by 87%) eliminates the dominant cost component.

The 51% wall-clock speedup is less than the shuffle reduction because non-shuffle work (file IO, join compute, result materialisation) becomes the new bottleneck once shuffle is eliminated. Standard tail-bottleneck dynamic.

Why this matters

Joins are the dominant cost on fact-to-fact joins at scale — two large tables joined on a high-cardinality key (customer_id, order_id, etc.). The classical optimisations don't apply:

  • Broadcast join (patterns/broadcast-join-for-small-reference-tables) needs one side to fit in memory; doesn't apply when both sides are large.
  • Sort-merge join is the standard fallback but pays the full shuffle cost.
  • Co-clustered join specifically targets the case where both sides are clustered on the join key — the case where the data layout already encodes the locality the join needs.

Composition with multi-dimensional clustering

The 2026-06-01 source frames co-clustered joins as a multi-dimensional clustering payoff: clustering decisions made to support filter pruning also support join shuffle elimination, provided the join key is among the clustering keys. The architect's column-choice decision compounds across both query patterns.

For a star-schema BI workload — fact table joined to several dimension tables on dimension-key columns — clustering the fact table on the dimension keys (one of the four allowed Liquid clustering keys) enables co-clustered joins to those dimensions when they are also clustered on the same keys.

What this is not

  • Not a join algorithm. Co-clustered join is a planner-level optimisation that elides shuffle ahead of any join algorithm (broadcast, sort-merge, hash). The actual join can still be a hash join or a sort-merge join on the local file-pair.
  • Not specific to Liquid Clustering. Other systems implement similar optimisations (Spark's bucketed join requires both sides bucketed identically; columnar query engines like Snowflake have micro-partition-aware joins). The 2026-06-01 source is the wiki's Liquid Clustering canonicalisation specifically.
  • Not free. The optimisation requires the planner to verify layout compatibility and may fall back to standard shuffle if the two tables have incompatible clustering states (one stale, one fresh; different key ordering; etc.).

Failure modes

  • Skewed clustering keys. If the join key has heavy hot keys (one customer with 10× the rows), file-pair joining concentrates load on the hot files. Standard skew mitigations apply.
  • Stale clustering on one side. If Liquid Clustering hasn't fully laid out one side after recent ingest, the planner falls back to shuffle.
  • Different clustering key shapes. Co-clustered join assumes both sides cluster on identical keys with compatible layouts; if one side is CLUSTER BY (date, customer_id) and the other is CLUSTER BY (customer_id), the planner may or may not recognise the partial compatibility (source doesn't disclose).

Sibling shuffle-elimination optimisations

  • Bucketed join (Spark) — shared-bucket-count join with pre-shuffled-and-stored layout.
  • Co-located partitions (Postgres FDW, sharded SQL engines) — shard-key-aligned joins skip cross-shard shuffle.
  • Z-Order multi-key alignment (limited Z-Order benefit on multi-column joins; usually less effective than dedicated co-clustered join optimisation).

Seen in

  • sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfoFirst wiki canonicalisation. Forward-looking disclosure as a Private Preview Liquid Clustering primitive. Names the disclosed performance envelope (~51% faster, 87% less shuffle data on a real-world data warehousing benchmark) and frames the mechanism as "removing shuffle that exists today even when data is already organized by the clustering columns". Reserved for future ingests: GA timeline, eligible workload classes, predicate- pushdown interaction, behaviour under skewed clustering keys, layout-compatibility detection algorithm.
Last updated · 542 distilled / 1,571 read