CONCEPT Cited by 1 source
Smart chunking¶
Smart chunking is a document-structure-preserving chunking strategy for LLM-based document intelligence — chunks are constructed to retain hierarchy and one-to-many relationships rather than being produced from a flat token-window slide over the document's text.
The term is canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws), which describes it as "a proprietary approach that goes far beyond pulling words off a page. Rather than treating a document as a flat sequence of text, smart chunking preserves hierarchical structure and one-to-many relationships within documents. It uses a combination of semantic and keyword search to decompose text into meaningful, context-aware chunks, applying dynamic parameters to maintain logical relationships throughout. Sequential identifiers and metadata-driven grouping organize these chunks into field groups, detecting overlaps and removing duplications while keeping the document's natural flow intact."
What "smart" means here¶
Naive chunking strategies for LLM input:
| Strategy | Mechanism | Failure mode |
|---|---|---|
| Fixed-token-window | Slide a 4 K / 8 K window over text | Cuts mid-clause; loses hierarchy |
| Sentence-split | Split on sentence boundaries | Loses paragraph + section context |
| Paragraph-split | Split on \n\n |
Loses table / list structure |
| Page-split | One chunk per OCR page | Page boundaries are layout, not semantic |
Smart chunking adds structural awareness on top of these — the chunk boundaries are placed where the document's own structure suggests, not where the token window happens to land. The disclosed properties:
- Hierarchy preservation — parent / child relationships (clause → sub-clause → exhibit → schedule) survive into chunk metadata.
- One-to-many relationship preservation — a parent chunk that references multiple child chunks retains those references.
- Semantic + keyword decomposition — both an embedding similarity signal and a keyword-match signal contribute to where chunk boundaries land.
- Dynamic parameters — chunk size / overlap / boundary criteria are not fixed constants but adapt to the document being chunked.
- Sequential identifiers + metadata-driven grouping — chunks carry IDs that let downstream consumers reconstruct the document's logical structure from a flat list of chunks.
- Overlap detection + duplication removal — overlapping chunks (introduced by sliding-window approaches) are detected and collapsed.
Why structure-preservation matters for LLM extraction¶
A contract's meaning often depends on where a clause lives — "a three-nested-level exhibit carries fundamentally different implications than a straightforward attached schedule" ( source). A flat-text chunker hands the LLM a clause without that nesting context; the LLM has to reconstruct it from text cues alone (or get it wrong). A smart chunker hands the LLM the clause plus metadata stating "this lives at exhibit-level depth 3 under schedule X under section Y", so the LLM can ground its extraction in the document's actual structure.
This is the structural argument behind the 55%-rules-based → 99%-LLM-with-smart-chunking accuracy step Doczy.ai discloses on the same contract corpus: rules can't reason about structural nesting, and a naive-chunked LLM has to guess at it; a smart-chunked LLM has it as input.
Sibling chunking strategies on the wiki¶
| Strategy | Source | What it preserves |
|---|---|---|
| Smart chunking | Doczy.ai (2026-06-02) | Hierarchy + one-to-many across chunks |
| concepts/metadata-only-embedding | Yelp CS Chatbot (2026-05-27) | Metadata as embedding signal; whole article as retrieval unit |
| concepts/whole-article-retrieval | Yelp (2026-05-27) | Article boundary as retrieval unit |
| concepts/intelligent-document-sampling | Databricks Unlocking Archives (2026-05-11) | Sample from large doc collection |
| concepts/multimodal-document-understanding | Databricks (2026-05-11) | Document as image, not text |
Smart chunking is input-side preprocessing; metadata-only embedding is embedding-signal selection; whole-article retrieval is retrieval-unit selection. The three are orthogonal axes of RAG-pipeline design.
Composition with dual clustering¶
In Doczy.ai, smart chunking output feeds into the dual clustering engine — chunks are then analysed both semantically (embed and group similar meanings) and structurally (pattern-recognise clause types, table layouts, hierarchical depth), and the two views are projected into a unified document model that grounds the LLM prompt. Smart chunking is the substrate that makes dual clustering meaningful — clusters are over chunks-with-metadata, not over flat text.
Mitigates / sibling failure modes¶
- concepts/embedding-signal-dilution — chunking too large dilutes embedding signal; smart chunking's dynamic parameters and semantic decomposition prevent this on the embedding side.
- Lost-context failure of fixed-window chunking — smart chunking's hierarchy preservation is the structural fix.
- Overlap duplication of sliding-window chunking — smart chunking's overlap detection collapses these.
When to apply¶
- Document-intelligence pipelines on contracts, regulatory filings, legal agreements, vendor agreements, healthcare-claims documents — anywhere structure is load-bearing for downstream extraction.
- Production scale ≥ 100 K documents/period where the accuracy-vs-volume curve makes engineering investment in chunker quality worthwhile.
When not to apply¶
- Short, flat documents (single-paragraph tickets, social posts) — hierarchy is absent or trivial.
- Pure semantic-search use cases over flat text — naive chunking + metadata-only embedding may suffice.
- Multimodal-direct pipelines that feed document images to a multimodal LLM (patterns/visual-first-document-extraction) — the LLM sees layout directly; pre-chunking is unnecessary.
Mechanism caveat¶
Smart chunking's mechanism — algorithm class, dynamic-parameter tuning, sequential-identifier scheme, projection-to-field-groups — is AArete IP / patented and not disclosed at any greater depth on the AWS Architecture Blog post. The wiki captures the disclosed properties; implementation internals are out of scope until or unless AArete (or another team replicating the approach) publishes deeper.
Seen in¶
- sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — first canonical wiki disclosure of the term "smart chunking" and the document-structure-preserving chunking concept; 99% extraction accuracy on contracts, with smart chunking + dual clustering named as the load-bearing accuracy contributors.
Related¶
- concepts/dual-clustering-document-intelligence — downstream consumer of smart-chunking output
- concepts/embedding-signal-dilution — sibling chunking-failure concept
- concepts/metadata-only-embedding — sibling chunking-strategy concept
- concepts/multimodal-document-understanding — alternative approach (skip OCR + chunking by feeding images to multimodal LLM)
- systems/doczy-ai
- systems/amazon-textract — typical upstream OCR provider
- patterns/managed-ai-document-intelligence-pipeline-on-aws
- patterns/visual-first-document-extraction — alternative pipeline shape