Skip to content

PATTERN Cited by 1 source

Cluster a table by WHERE-clause columns to reduce data-scanned cost

When to use

You have an OLAP data warehouse — BigQuery, Snowflake, Redshift, Athena, Databricks — where per-query billing is proportional to bytes scanned, and one or more hot queries are scanning significantly more data than they logically need to.

The canonical example

Shopify's 2022 engineering post "Reducing BigQuery Costs: How We Fixed A $1 Million Query" (summarized in the High Scalability Dec-2022 roundup):

  • Before: one query scanned 75 GB per execution. Execution rate projected $949,218.75/month.
  • After: the table was clustered on two feature columns from the query's WHERE clause. Same query now scanned 508.1 MB (~150× reduction).
  • Projected cost: ~$1,370.67/month — 99.86% reduction, same query plan, no application-level changes.

The mechanism

BigQuery (and equivalents on other warehouses) physically groups rows with matching cluster-column values into adjacent storage blocks. When a query has a WHERE predicate on a clustered column, the query planner can skip blocks that don't match — bytes not read, bytes not billed.

Table without clustering:
  blocks: [mixed] [mixed] [mixed] [mixed] [mixed]
  query scans all blocks

Table clustered on col_A:
  blocks: [A=1-50] [A=51-100] [A=101-200] ...
  query WHERE A=75 scans 1 block, skips the others

Preconditions

  • The warehouse must support clustering / physical layout hints — BigQuery CLUSTER BY, Snowflake auto-clustering, Redshift sort keys, Delta Lake ZORDER.
  • The query must filter on the clustered columns — the optimization is useless if the predicate doesn't prune.
  • Cluster column cardinality matters — high-cardinality columns give finer-grained pruning; low-cardinality (~10 distinct values) gives only 10× speedup at best.

Why this is the first optimization to try

On any data-scanned-billed warehouse, a single clustered- correctly table can change a query from "million-dollar query" to "negligible cost" with:

  • Zero client-side code change.
  • One DDL statement (or a table rebuild).
  • No change in query plan shape — the optimizer still picks the same plan, just with pruning.

The ROI vs. engineering cost is lopsided enough that on every $(expensive) recurring query, the diagnostic is:

  1. EXPLAIN to see what the query is actually scanning.
  2. If the scanned bytes ≫ logical working set → check the physical layout.
  3. If the table isn't clustered on the predicate columns → cluster it.

Trade-offs

  • Ingestion-time overhead: new data either needs to pre-sort or the warehouse does background re-clustering (both have cost, usually small).
  • One clustering order per table — if different queries filter on different columns, clustering helps the ones whose columns are clustered and doesn't help the others. A table can usually only have 1-4 clustering columns.
  • Won't help aggregation-heavy queries that need to visit every row regardless.

Seen in

Last updated · 319 distilled / 1,201 read