Skip to content

CONCEPT Cited by 1 source

Index-candidate filtering

Index-candidate filtering is the practice of using workload telemetry to decide which query patterns are candidates for indexing at all, before running any index-tuning logic (human, LLM, or algorithmic) against them. The filter exists because indexes are not free — every secondary index costs storage, memory, and write-path overhead — and the population of queries on a production database is dominated by patterns that either don't benefit from indexing or aren't run often enough to justify the write tax.

The canonical formulation comes from PlanetScale's Insights AI-powered index suggestions feature (sources/2026-04-21-planetscale-ai-powered-postgres-index-suggestions):

When we manually examine a database to find queries that might benefit from an index, we typically look for query patterns with a high ratio of rows read to rows returned. If a query is reading a much larger set of rows than it is returning, it's an indication that an index (or a more selective index) could improve response times.

We also filter the set of query patterns to those [that] are using significant resources. In particular we require that the query pattern is responsible for at least 0.1% of the aggregated runtime of all queries, and that it has been run a minimum number of times.

The three filter axes

  1. Rows-read to rows-returned ratio. A high ratio (reading many rows, returning few) signals a scan that a more-selective index could short-circuit. A low ratio means the query is already retrieving close to what it reads — an index isn't going to help much.

  2. Minimum workload runtime share. Only query patterns responsible for a non-trivial fraction of total runtime are candidates. PlanetScale's threshold: ≥ 0.1% of the aggregated runtime of all queries. Anything below that can't be worth an index even if the index would help — the total time saved is noise.

  3. Minimum frequency. The pattern must have been run at least "a minimum number of times." Ad-hoc queries — a one-off analytics question, a debugging session — aren't worth indexing, because the index's write tax applies forever but the read benefit applies once.

All three filters encode the same economic truth: indexes pay a write-side tax on every row mutation, forever, in exchange for read-side wins that only accrue while the query keeps running. If the query doesn't run often, or doesn't contribute meaningfully to workload cost, or isn't reading much more than it returns, the trade doesn't pencil.

Why you want this filter before the LLM

For LLM-based index suggesters, the filter is structurally load-bearing:

  • LLMs over-recommend when asked to decide if anything needs changing. Handing the model every query pattern in the workload biases it toward finding something to optimise. Pre-filtering converts the prompt from "is there an opportunity here?" to "given that there is an opportunity, what index would help?" — a question LLMs handle well.
  • Prompt cost scales with query count. Every query pattern handed to the LLM takes tokens. Filtering keeps prompts short and focused.
  • Signal-to-noise matters for validation. If the LLM is asked to cite which queries each proposed index helps, and most of the input queries can't benefit from any index, most citations will be false positives that the validator (e.g. HypoPG evaluation) rejects — wasted LLM effort + wasted validation effort.

See patterns/workload-aware-llm-prompting for the prompt-engineering pattern built on top of this filter.

Seen in

  • sources/2026-04-21-planetscale-ai-powered-postgres-index-suggestions — canonical wiki instance. Three axes (rows-read ratio + ≥ 0.1% runtime share + minimum frequency) applied to PlanetScale Postgres workloads before any LLM or rule-based suggester is invoked. Rationale stated directly: "since indexes incur storage, memory and write overhead, we want to avoid suggesting indexes for ad-hoc or infrequently run queries."
Last updated · 319 distilled / 1,201 read