CONCEPT Cited by 1 source
Predicate pushdown¶
Predicate pushdown is the query-optimization technique of pushing filter predicates (WHERE clauses) down into the storage layer so that irrelevant data is never read into the execution engine. In columnar formats like Parquet, this means consulting per-row-group min/max statistics to skip entire row groups that cannot satisfy the predicate.
Effectiveness depends on file size¶
Row group statistics work best when row groups are substantial. Tiny Parquet files (e.g. 500 KB written every 30 seconds from a streaming flush) often have row groups too small to prune meaningfully — the statistics span so little data that most predicates match anyway. Larger files (32+ MB) with well-populated row groups enable significant data-skipping savings (Source: sources/2026-06-23-redpanda-bridge-queries-in-redpanda-sql).
Relationship to small file problem¶
The small file problem directly undermines predicate pushdown: thousands of tiny files not only multiply S3 request costs but also defeat statistics-based pruning. Flushing less often (enabled by flush/freshness decoupling) produces the large files that make pushdown effective.
Seen in¶
- systems/apache-iceberg — Iceberg manifest-level statistics enable partition pruning and data file skipping
- systems/apache-parquet — row group footer statistics power column-level pushdown
- systems/redpanda-sql — benefits from large Parquet files written by decoupled flush intervals
- sources/2026-06-23-redpanda-bridge-queries-in-redpanda-sql — describes effectiveness scaling with file size