CONCEPT Cited by 2 sources
File-level data skipping¶
Definition¶
File-level data skipping is the query-planning technique whereby a query engine, given a predicate, eliminates whole data files from the scan set by comparing the predicate against per-file statistics (typically min/max/null-count) without opening the files. The mechanism lives in the table format's metadata layer — the Delta transaction log or the Iceberg manifest list — not in the filesystem directory structure. The engine reads metadata, evaluates the predicate against each file's statistic ranges, and only opens files whose statistics could possibly satisfy the predicate.
The 2026-06-01 Databricks "Debunking 8 data layout myths" post is the wiki's canonical disclosure of the architectural fact that on modern open table formats, this is the only pruning that exists:
"Directory-pruning does not exist on modern open table formats like Delta and Iceberg. Delta, for example, uses a transaction log to track every data file along with per-column statistics, and pruning happens against those statistics, not the directory structure. The engine never lists directories to plan a query. It reads the transaction log, evaluates filters against statistics, and skips files that don't match. Liquid Clustering uses the same mechanism. Whether your data lives in
date=x/hour=y/or a flat directory of clustered files, the engine prunes at file granularity. There is no directory-level shortcut to lose."
This is load-bearing: it means the "directory pruning" benefit that defenders of partitioning invoke doesn't exist on Delta or Iceberg. The pruning was always file-level, always statistics-driven, always table-format-mediated.
Mechanism¶
Where the statistics live¶
For Delta: each Add action in the
transaction log carries a stats JSON blob with numRecords,
minValues, maxValues, nullCount per indexed column.
For Iceberg: each entry in the manifest
file carries lower_bound / upper_bound / null_value_count /
nan_value_count per column (column ID-keyed).
In both cases, the statistics are per-file (not per-row, not per-row-group within a file — that's a Parquet-internal detail below the file granularity).
Query-planning loop¶
for file in table.metadata.files:
if file.stats.min[col] > query.predicate.upper:
skip
elif file.stats.max[col] < query.predicate.lower:
skip
else:
include in scan set
The mechanism is conceptually identical to range-min-max indexing in classical OLTP databases, but operates on whole files as the unit of granularity instead of B-tree pages. Within a surviving file, Parquet's row-group statistics provide a second skip level — but that's distinct from the table-format-level file pruning this concept page addresses.
Statistics maintenance¶
Stats are computed and embedded on the write path by the engine producing the file. On Databricks, this happens "during Photon writes" (Source: sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco). Predictive Optimization "back-fills stats for existing tables" so the benefit isn't gated on table re-creation.
Why this changes the partitioning argument¶
Traditional partitioning's "pruning benefit" was structurally identical to file pruning: skip data we know cannot match. The only difference was the granularity unit (directory vs. file) and the metadata source (filesystem path vs. transaction log). On modern OTFs:
| Aspect | Directory pruning (myth) | File-level data skipping (reality) |
|---|---|---|
| Granularity | Whole directory (= whole partition = many files) | Single file |
| Metadata source | Filesystem listing of date=x/hour=y/ |
Transaction log / manifest stats |
| Multi-column | One fixed column (the partition key) | All columns with stats can prune simultaneously |
| Layout flexibility | Fixed at table creation | Layout independent of semantic |
The "pruning" benefit defenders attribute to directory partitioning is actually file-level skipping — and file-level skipping works identically with Liquid-clustered files in flat directories. The directory layout was incidental, not load-bearing.
Composition with clustering layout¶
File-level data skipping's selectivity depends on the file layout: tightly co-located files have narrow min/max ranges and prune effectively; randomly-laid-out files have wide ranges and prune poorly.
This is the architectural role of Liquid Clustering — it is "how the engine organizes files for efficient data skipping" (Source: sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo). Liquid Clustering does not invent a new pruning mechanism; it makes the existing file-level skipping work better by ensuring that files contain rows clustered on the dimensions queries actually filter on.
The 2026-06-01 source explicitly: "Liquid Clustering is a write-side optimization. It's how the engine organizes files for efficient data skipping. The output is standard Parquet files with min/max stats, written into open table formats like Delta/Iceberg. Any compatible reader (e.g. open-source Apache Spark, DuckDB, etc.) can use those stats to skip files." The skipping is a property of the standard format; clustering improves the skipping selectivity but does not gate which readers can benefit.
Composition with metadata-only operations¶
Per-file min/max statistics double as the substrate for metadata-only operations: DELETEs aligned with file boundaries, COUNTs / DISTINCTs / GROUP BYs that can be answered from stats alone. The 2026-06-01 source verbatim: "The engine uses the same per-file min/max stats it uses for data skipping to determine when a query's answer can be computed from metadata alone."
This is a load-bearing architectural property: the same metadata that powers query-time file pruning also powers data-mutation acceleration, so the ROI on maintaining good statistics compounds across both query types. See concepts/optimizer-statistics-as-skipping-substrate for the generalised principle.
Sibling skipping mechanisms on the wiki¶
| Sibling | Domain | What gets skipped |
|---|---|---|
| concepts/partition-pruning | MySQL partitioning | Whole partitions (server-side) |
| concepts/clickhouse-data-part | ClickHouse data parts | Whole parts (similar shape, different terminology) |
| Bloom filters | Lookup queries | Whole files where bloom misses |
Skip indexes (ClickHouse MinMaxIdx) |
Per-block min/max | Block reads within a part |
| Index range pruning | B-tree / B+-tree | Tree branches |
| Parquet row-group statistics | Within-file pruning | Row groups within a single file |
The shared principle across all of these: metadata that summarises content lets the engine prune content reads. File-level data skipping is the modern open-table-format incarnation, operating at the file granularity in the transaction log / manifest layer.
Failure modes¶
- Stale stats. When stats drift (writes without stats collection, or back-filled stats with wrong column selection), pruning silently degrades and queries revert to full scans. Mitigation: automatic stats collection on the write path (concepts/automatic-table-optimization).
- Wide stats due to bad layout. Random row distribution within files produces min/max ranges that cover the full domain on every file → pruning eliminates nothing. Mitigation: clustering (systems/liquid-clustering) or sorting on filter-relevant columns.
- Stats-less columns. Stats only collected on configured columns; predicates on stats-less columns can't prune. Mitigation: configure stats columns to match the workload's predicate vocabulary.
- Wrong-column statistics. If the substrate maintains stats on columns not used as filter predicates, the cost is wasted. Mitigation: workload-driven stats column selection (Predictive Optimization).
- High-cardinality stats inflation. Per-file min/max on high-cardinality columns can bloat the transaction log / manifest if the file count is high. Mitigation: choose stat columns by predicate prevalence + selectivity, not by raw cardinality.
What this is not¶
- Not partition pruning. Partition pruning operates on filesystem directory structure; file-level data skipping operates on table-format metadata. They achieve similar query-time outcomes through different mechanisms.
- Not Parquet row-group skipping. Parquet's internal row-group statistics let the reader skip row groups within a single opened file. File-level data skipping prevents the file from being opened at all. They compose: file-level skipping happens first, row-group skipping happens within surviving files.
- Not bloom-filter pruning. Bloom filters answer set-membership queries probabilistically; min/max statistics answer range queries deterministically. They serve different predicate shapes.
Seen in¶
- sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo — First wiki canonicalisation of the architectural fact that on modern OTFs, file-level data skipping is the only pruning that exists. The verbatim disclosure that "directory-pruning does not exist on modern open table formats" and that "the engine never lists directories to plan a query. It reads the transaction log, evaluates filters against statistics, and skips files that don't match" is load-bearing for the post's case against Hive-style partitioning. Names the dual role of per-file statistics in skipping AND metadata-only operations.
- sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco — Earlier disclosure of the inline-during-Photon-writes statistics collection mechanism + back-fill for existing tables; canonicalised more abstractly at concepts/optimizer-statistics-as-skipping-substrate.
Related¶
- systems/delta-lake — transaction log carries per-file stats.
- systems/apache-iceberg — manifest entries carry per-file bounds.
- systems/liquid-clustering — write-side optimization that improves stats selectivity.
- systems/databricks-predictive-optimization — automates stats collection + back-fill.
- concepts/optimizer-statistics-as-skipping-substrate — the generalised principle that statistics are the substrate making skipping possible.
- concepts/over-partitioning — the failure mode that arises when teams treat directory layout as the pruning mechanism.
- concepts/metadata-only-operation — the dual use of the same stats for mutation acceleration.
- concepts/partition-pruning — the directory-level cousin in classical relational systems.