CONCEPT Cited by 1 source

Intelligent Document Sampling¶

Intelligent Document Sampling is a cost-reduction strategy for LLM-driven document classification: instead of running the model over every page of a long document, sample only the most informative sections (title pages, introductions, conclusions, summary tables) and aggregate page-level results back to a document-level classification. Short documents are processed in full; long documents are processed selectively.

Why it matters¶

Multimodal LLM inference per page is expensive — both in dollars and in wall-clock time. Document corpora have a heavy-tailed page distribution: a few documents with thousands of pages dominate the total page count. Naive page-by-page-everything runs blow the budget on the long-tail documents while delivering diminishing returns, because most of the classification signal in a long technical report sits in the title page, executive summary, introduction, and conclusion.

The sampling strategy exploits two properties:

Document type signal is concentrated. A geological survey's first 5 pages and last 5 pages disclose what it is and what it covered far more reliably than a random middle page.
Aggregation tolerates miss. Page-level classifications aggregated up to a document-level label are robust to a handful of misclassified or sampled-out pages — the document gets one final tag set, not per-page tags.

The MapAid groundwater pipeline reports >70% AI processing volume reduction "while preserving classification quality." The quality side is asserted via the inline judge's excellent/good rate of 95%.

Sampling strategy in the MapAid pipeline¶

"Shorter documents are analyzed in full, while longer documents are sampled from their most informative sections (title pages, introductions, and conclusions). This reduced AI processing volume by more than 70% while preserving classification quality." (Source: sources/2026-05-11-databricks-unlocking-the-archives)

The threshold between "short" and "long" is not disclosed; nor is the exact sampling fraction within long documents. The structural shape is what's portable: decide-by-length, sample informative sections, aggregate up.

When this works vs when it doesn't¶

Works: - Document classification (what is this document about, where does it apply). - Document routing (does this document warrant a deeper extraction pass?). - Document tagging for searchability.

Doesn't work: - Per-page structured extraction (well coordinates appear on page 47, not in the introduction). The MapAid pipeline acknowledges this by switching strategy at the second pass: once a document is flagged water-relevant, every page is processed. Sampling is for the classification pass only. - Documents where signal is uniformly distributed (logs, time-series reports, transcripts) — sampling title pages misses the actual content.

Distinction from random sampling¶

This isn't sample(N) on the page set. It is structurally informed sampling: the sampler has a model of which sections of a long document carry the classification signal. Title pages exist because authors put the document's identity there. Introductions exist because authors put the document's scope there. Conclusions exist because authors put the document's findings there. The sampler is exploiting authoring conventions.

Seen in¶

sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. ~700 documents / 5,570 pages classified with >70% volume reduction; quality preserved per the inline LLM judge.