Skip to content

CONCEPT Cited by 2 sources

File-level data skipping

Definition

File-level data skipping is the query-planning technique whereby a query engine, given a predicate, eliminates whole data files from the scan set by comparing the predicate against per-file statistics (typically min/max/null-count) without opening the files. The mechanism lives in the table format's metadata layer — the Delta transaction log or the Iceberg manifest list — not in the filesystem directory structure. The engine reads metadata, evaluates the predicate against each file's statistic ranges, and only opens files whose statistics could possibly satisfy the predicate.

The 2026-06-01 Databricks "Debunking 8 data layout myths" post is the wiki's canonical disclosure of the architectural fact that on modern open table formats, this is the only pruning that exists:

"Directory-pruning does not exist on modern open table formats like Delta and Iceberg. Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure. The engine never lists directories to plan a query. It reads the transaction log, evaluates filters against statistics, and skips files that don't match. Liquid Clustering uses the same mechanism. Whether your data lives in date=x/hour=y/ or a flat directory of clustered files, the engine prunes at file granularity. There is no directory-level shortcut to lose."

This is load-bearing: it means the "directory pruning" benefit that defenders of partitioning invoke doesn't exist on Delta or Iceberg. The pruning was always file-level, always statistics-driven, always table-format-mediated.

Mechanism

Where the statistics live

For Delta: each Add action in the transaction log carries a stats JSON blob with numRecords, minValues, maxValues, nullCount per indexed column.

For Iceberg: each entry in the manifest file carries lower_bound / upper_bound / null_value_count / nan_value_count per column (column ID-keyed).

In both cases, the statistics are per-file (not per-row, not per-row-group within a file — that's a Parquet-internal detail below the file granularity).

Query-planning loop

for file in table.metadata.files:
    if file.stats.min[col] > query.predicate.upper:
        skip
    elif file.stats.max[col] < query.predicate.lower:
        skip
    else:
        include in scan set

The mechanism is conceptually identical to range-min-max indexing in classical OLTP databases, but operates on whole files as the unit of granularity instead of B-tree pages. Within a surviving file, Parquet's row-group statistics provide a second skip level — but that's distinct from the table-format-level file pruning this concept page addresses.

Statistics maintenance

Stats are computed and embedded on the write path by the engine producing the file. On Databricks, this happens "during Photon writes" (Source: sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco). Predictive Optimization "back-fills stats for existing tables" so the benefit isn't gated on table re-creation.

Why this changes the partitioning argument

Traditional partitioning's "pruning benefit" was structurally identical to file pruning: skip data we know cannot match. The only difference was the granularity unit (directory vs. file) and the metadata source (filesystem path vs. transaction log). On modern OTFs:

Aspect Directory pruning (myth) File-level data skipping (reality)
Granularity Whole directory (= whole partition = many files) Single file
Metadata source Filesystem listing of date=x/hour=y/ Transaction log / manifest stats
Multi-column One fixed column (the partition key) All columns with stats can prune simultaneously
Layout flexibility Fixed at table creation Layout independent of semantic

The "pruning" benefit defenders attribute to directory partitioning is actually file-level skipping — and file-level skipping works identically with Liquid-clustered files in flat directories. The directory layout was incidental, not load-bearing.

Composition with clustering layout

File-level data skipping's selectivity depends on the file layout: tightly co-located files have narrow min/max ranges and prune effectively; randomly-laid-out files have wide ranges and prune poorly.

This is the architectural role of Liquid Clustering — it is "how the engine organizes files for efficient data skipping" (Source: sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo). Liquid Clustering does not invent a new pruning mechanism; it makes the existing file-level skipping work better by ensuring that files contain rows clustered on the dimensions queries actually filter on.

The 2026-06-01 source explicitly: "Liquid Clustering is a write-side optimization. It's how the engine organizes files for efficient data skipping. The output is standard Parquet files with min/max stats, written into open table formats like Delta/Iceberg. Any compatible reader (e.g. open-source Apache Spark, DuckDB, etc.) can use those stats to skip files." The skipping is a property of the standard format; clustering improves the skipping selectivity but does not gate which readers can benefit.

Composition with metadata-only operations

Per-file min/max statistics double as the substrate for metadata-only operations: DELETEs aligned with file boundaries, COUNTs / DISTINCTs / GROUP BYs that can be answered from stats alone. The 2026-06-01 source verbatim: "The engine uses the same per-file min/max stats it uses for data skipping to determine when a query's answer can be computed from metadata alone."

This is a load-bearing architectural property: the same metadata that powers query-time file pruning also powers data-mutation acceleration, so the ROI on maintaining good statistics compounds across both query types. See concepts/optimizer-statistics-as-skipping-substrate for the generalised principle.

Sibling skipping mechanisms on the wiki

Sibling Domain What gets skipped
concepts/partition-pruning MySQL partitioning Whole partitions (server-side)
concepts/clickhouse-data-part ClickHouse data parts Whole parts (similar shape, different terminology)
Bloom filters Lookup queries Whole files where bloom misses
Skip indexes (ClickHouse MinMaxIdx) Per-block min/max Block reads within a part
Index range pruning B-tree / B+-tree Tree branches
Parquet row-group statistics Within-file pruning Row groups within a single file

The shared principle across all of these: metadata that summarises content lets the engine prune content reads. File-level data skipping is the modern open-table-format incarnation, operating at the file granularity in the transaction log / manifest layer.

Failure modes

  • Stale stats. When stats drift (writes without stats collection, or back-filled stats with wrong column selection), pruning silently degrades and queries revert to full scans. Mitigation: automatic stats collection on the write path (concepts/automatic-table-optimization).
  • Wide stats due to bad layout. Random row distribution within files produces min/max ranges that cover the full domain on every file → pruning eliminates nothing. Mitigation: clustering (systems/liquid-clustering) or sorting on filter-relevant columns.
  • Stats-less columns. Stats only collected on configured columns; predicates on stats-less columns can't prune. Mitigation: configure stats columns to match the workload's predicate vocabulary.
  • Wrong-column statistics. If the substrate maintains stats on columns not used as filter predicates, the cost is wasted. Mitigation: workload-driven stats column selection (Predictive Optimization).
  • High-cardinality stats inflation. Per-file min/max on high-cardinality columns can bloat the transaction log / manifest if the file count is high. Mitigation: choose stat columns by predicate prevalence + selectivity, not by raw cardinality.

What this is not

  • Not partition pruning. Partition pruning operates on filesystem directory structure; file-level data skipping operates on table-format metadata. They achieve similar query-time outcomes through different mechanisms.
  • Not Parquet row-group skipping. Parquet's internal row-group statistics let the reader skip row groups within a single opened file. File-level data skipping prevents the file from being opened at all. They compose: file-level skipping happens first, row-group skipping happens within surviving files.
  • Not bloom-filter pruning. Bloom filters answer set-membership queries probabilistically; min/max statistics answer range queries deterministically. They serve different predicate shapes.

Seen in

Last updated · 542 distilled / 1,571 read