Skip to content

CONCEPT Cited by 3 sources

Over-partitioning

Definition

Over-partitioning is the structural failure mode where a Hive-style partitioned table is laid out at finer granularity than the workload requires, producing many small files, driver-side memory pressure from millions of partition descriptors, and planner overhead from per-partition metadata operations. The pathology is not a usage bug — it is a property of the partition-column-at-table- creation decision: the architect must commit to a fixed column choice before the workload is fully characterised, and a cardinality mistake in either direction produces failures.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post canonicalises the failure rate verbatim:

"in our analysis, Hive-style partitioning leads to over-partitioning and small-file problems in more than 75% of cases."

That is, on the Databricks customer base where partition design is visible, more than three out of four partitioned tables exhibit the pathology — strong evidence that over-partitioning is the rule, not the exception.

How it happens

Three distinct mistake classes all converge on the over-partitioning shape:

Mistake Mechanism Failure mode
High-cardinality partition column Each distinct value becomes a directory; many distinct values × few rows per value → tiny files Billions of tiny files; metadata bloat; planner OOM
Wrong partition column for actual filters Queries filter on column Y but table is partitioned on column X Every query scans every partition; "queries may get slower, not faster"
Over-fine partition granularity Partition by hour instead of date for queries that span days Every query opens 24× more directories than needed

The Databricks post:

"Pick a column with too high cardinality and you get billions of tiny files. Pick the wrong column and queries may get slower, not faster. Either way, you're stuck rewriting the table."

The "stuck rewriting" clause is the load-bearing one: Hive-style partitioning is not changeable without a full table rewrite, so the cost of the mistake compounds over the table's lifetime.

Symptoms

  • File count drift over time. A partition design that was fine at 100 GB stops being fine at 10 TB; nothing changed in the workload but the small-file count crossed a planner-listing threshold.
  • OPTIMIZE / VACUUM cost dominating ETL cost. The maintenance overhead to keep small files compacted exceeds the cost of the ETL writing them.
  • Query planning time exceeding query execution time. When the planner has to walk millions of partition entries to compute the scan set, planning becomes the bottleneck.
  • Driver-side OOMs on innocuous queries. Many partition descriptors held in memory by the query planner produce driver-memory pressure that surfaces as plan failures rather than scan failures.
  • Storage bloat without data growth. Per-file Parquet footers and per-partition metadata files accumulate; total storage exceeds the logical data size by 1.5×–3× for severely over-partitioned tables.

The Octopus Energy disclosure

The 2026-05-23 Octopus Energy MHHS rebuild (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction) names all three over-partitioning consequences explicitly when explaining why the team chose Liquid Clustering from the start:

"Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning."

That source is the operational evidence of the over-partitioning failure modes; the 2026-06-01 source is the industry-rate disclosure (75%+) and the structural diagnosis (it's the column-choice-at-creation contract, not a usage error).

Why partitioning amplifies the small-file problem

The interaction with small-file problems on object storage is structural: each partition is a directory, and writes within a partition produce files in that directory. The smaller the partitions get (high cardinality), the fewer rows each file contains, the smaller each file is. Object stores like S3 / GCS / ADLS add per-file listing cost on top of per-file Parquet footer cost on top of per-partition metadata cost — small files accumulate three layers of overhead, each O(N) in file count.

Liquid Clustering structurally avoids this because clustering keys do not define directory partitions — the system "always tries to create files of a good size" even on high-cardinality clustering keys. The 2026-06-01 source: "Liquid Clustering is designed to be better than partitioning when clustering on a high-cardinality column, as it always tries to create files of a good size."

The myth defenders cite

Defenders of partitioning often cite directory-pruning as a performance advantage that Liquid Clustering forfeits. The 2026-06-01 Databricks post debunks this directly (Myth #1):

"Directory-pruning does not exist on modern open table formats like Delta and Iceberg. Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure. The engine never lists directories to plan a query."

In other words: the "performance benefit" of partitioning that defenders invoke to justify accepting the over-partitioning cost does not exist on modern OTFs. Pruning happens via per-file min/max statistics in the transaction log (concepts/file-level-data-skipping) regardless of whether those files live in date=x/hour=y/ directories or a flat directory of clustered files. Once that's accepted, partitioning's cost side remains (over-partitioning, structural rigidity, full-rewrite cost of column changes) without a corresponding benefit.

When over-partitioning is acceptable

  • Single, low-cardinality, stable filter column. A daily-ingest date column whose value range is bounded and whose retention policy aligns with the partition boundary is structurally safe.
  • Pre-existing tooling that requires directory layout. Some legacy readers or external systems consume tables by directory scan and cannot use file-level statistics.
  • Sub-100GB tables. At small scale, the small-file problem doesn't bite hard enough to matter.

For everything else, the 2026-06-01 source makes the recommendation explicit: "In 2026, the layout should be an implementation detail of the table, with every engine that reads or writes benefitting from it."

Mitigations on existing partitioned tables

  • In-place Liquid Conversion. The 2026-06-01 source discloses ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY (Private Preview); Bolt's case validates "zero downtime" — see patterns/in-place-partitioned-to-clustered-conversion.
  • Compaction. OPTIMIZE compacts small files into larger ones; doesn't fix the root cause but reduces symptoms. Predictive Optimization automates this on managed tables.
  • Re-partitioning. Full table rewrite to a different partition column. Operationally painful (downtime, dual-write coordination, consumer cutover) but sometimes the only choice on legacy external tables.

Sibling failure modes on the wiki

Sibling Domain Shared shape
concepts/small-file-problem-on-object-storage Streaming sinks Many small files; per-file listing cost; planner overhead
concepts/partition-skew-data-skew Skewed partitioning Some partitions oversized, others tiny; symmetric to over-partitioning's small-file pathology but at the opposite tail
Cardinality-misestimation (sibling concept space; not yet on wiki) Index design Wrong cardinality estimate produces wrong physical layout choice

The shared principle: physical-layout commitments made before workload characterisation are bets, not designs. Over-partitioning is what happens when the bet loses on the cardinality dimension; partition-skew is what happens when it loses on the distribution dimension.

Seen in

DynamicTimeSliceConfigWorker:
  namespace: my_dataset_1
  Observed: TimeSlices have p99 partitions below configured target of 10MB.
  Proposed: time_bucket interval: 60s -> 604800s

This is the wiki's first canonical instance of over-partitioning remedied without a full table rewrite — only future slices get the new strategy; past slices keep their broken shape and age out via retention TTL. Significant because the 2026-06-01 Databricks source frames over-partitioning as "stuck rewriting" on Hive tables; on Cassandra's slice-bounded shape with an auto-tuner, the rewrite is not required.

Last updated · 542 distilled / 1,571 read