CONCEPT Cited by 1 source

Smart chunking¶

Smart chunking is a document-structure-preserving chunking strategy for LLM-based document intelligence — chunks are constructed to retain hierarchy and one-to-many relationships rather than being produced from a flat token-window slide over the document's text.

The term is canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws), which describes it as "a proprietary approach that goes far beyond pulling words off a page. Rather than treating a document as a flat sequence of text, smart chunking preserves hierarchical structure and one-to-many relationships within documents. It uses a combination of semantic and keyword search to decompose text into meaningful, context-aware chunks, applying dynamic parameters to maintain logical relationships throughout. Sequential identifiers and metadata-driven grouping organize these chunks into field groups, detecting overlaps and removing duplications while keeping the document's natural flow intact."

What "smart" means here¶

Naive chunking strategies for LLM input:

Strategy	Mechanism	Failure mode
Fixed-token-window	Slide a 4 K / 8 K window over text	Cuts mid-clause; loses hierarchy
Sentence-split	Split on sentence boundaries	Loses paragraph + section context
Paragraph-split	Split on `\n\n`	Loses table / list structure
Page-split	One chunk per OCR page	Page boundaries are layout, not semantic

Smart chunking adds structural awareness on top of these — the chunk boundaries are placed where the document's own structure suggests, not where the token window happens to land. The disclosed properties:

Hierarchy preservation — parent / child relationships (clause → sub-clause → exhibit → schedule) survive into chunk metadata.
One-to-many relationship preservation — a parent chunk that references multiple child chunks retains those references.
Semantic + keyword decomposition — both an embedding similarity signal and a keyword-match signal contribute to where chunk boundaries land.
Dynamic parameters — chunk size / overlap / boundary criteria are not fixed constants but adapt to the document being chunked.
Sequential identifiers + metadata-driven grouping — chunks carry IDs that let downstream consumers reconstruct the document's logical structure from a flat list of chunks.
Overlap detection + duplication removal — overlapping chunks (introduced by sliding-window approaches) are detected and collapsed.

Why structure-preservation matters for LLM extraction¶

A contract's meaning often depends on where a clause lives — "a three-nested-level exhibit carries fundamentally different implications than a straightforward attached schedule" ( source). A flat-text chunker hands the LLM a clause without that nesting context; the LLM has to reconstruct it from text cues alone (or get it wrong). A smart chunker hands the LLM the clause plus metadata stating "this lives at exhibit-level depth 3 under schedule X under section Y", so the LLM can ground its extraction in the document's actual structure.

This is the structural argument behind the 55%-rules-based → 99%-LLM-with-smart-chunking accuracy step Doczy.ai discloses on the same contract corpus: rules can't reason about structural nesting, and a naive-chunked LLM has to guess at it; a smart-chunked LLM has it as input.

Sibling chunking strategies on the wiki¶

Strategy	Source	What it preserves
Smart chunking	Doczy.ai (2026-06-02)	Hierarchy + one-to-many across chunks
concepts/metadata-only-embedding	Yelp CS Chatbot (2026-05-27)	Metadata as embedding signal; whole article as retrieval unit
concepts/whole-article-retrieval	Yelp (2026-05-27)	Article boundary as retrieval unit
concepts/intelligent-document-sampling	Databricks Unlocking Archives (2026-05-11)	Sample from large doc collection
concepts/multimodal-document-understanding	Databricks (2026-05-11)	Document as image, not text

Smart chunking is input-side preprocessing; metadata-only embedding is embedding-signal selection; whole-article retrieval is retrieval-unit selection. The three are orthogonal axes of RAG-pipeline design.

Composition with dual clustering¶

In Doczy.ai, smart chunking output feeds into the dual clustering engine — chunks are then analysed both semantically (embed and group similar meanings) and structurally (pattern-recognise clause types, table layouts, hierarchical depth), and the two views are projected into a unified document model that grounds the LLM prompt. Smart chunking is the substrate that makes dual clustering meaningful — clusters are over chunks-with-metadata, not over flat text.

Mitigates / sibling failure modes¶

concepts/embedding-signal-dilution — chunking too large dilutes embedding signal; smart chunking's dynamic parameters and semantic decomposition prevent this on the embedding side.
Lost-context failure of fixed-window chunking — smart chunking's hierarchy preservation is the structural fix.
Overlap duplication of sliding-window chunking — smart chunking's overlap detection collapses these.

When to apply¶

Document-intelligence pipelines on contracts, regulatory filings, legal agreements, vendor agreements, healthcare-claims documents — anywhere structure is load-bearing for downstream extraction.
Production scale ≥ 100 K documents/period where the accuracy-vs-volume curve makes engineering investment in chunker quality worthwhile.

When not to apply¶

Short, flat documents (single-paragraph tickets, social posts) — hierarchy is absent or trivial.
Pure semantic-search use cases over flat text — naive chunking + metadata-only embedding may suffice.
Multimodal-direct pipelines that feed document images to a multimodal LLM (patterns/visual-first-document-extraction) — the LLM sees layout directly; pre-chunking is unnecessary.

Mechanism caveat¶

Smart chunking's mechanism — algorithm class, dynamic-parameter tuning, sequential-identifier scheme, projection-to-field-groups — is AArete IP / patented and not disclosed at any greater depth on the AWS Architecture Blog post. The wiki captures the disclosed properties; implementation internals are out of scope until or unless AArete (or another team replicating the approach) publishes deeper.

Seen in¶

sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — first canonical wiki disclosure of the term "smart chunking" and the document-structure-preserving chunking concept; 99% extraction accuracy on contracts, with smart chunking + dual clustering named as the load-bearing accuracy contributors.

concepts/dual-clustering-document-intelligence — downstream consumer of smart-chunking output
concepts/embedding-signal-dilution — sibling chunking-failure concept
concepts/metadata-only-embedding — sibling chunking-strategy concept
concepts/multimodal-document-understanding — alternative approach (skip OCR + chunking by feeding images to multimodal LLM)
systems/doczy-ai
systems/amazon-textract — typical upstream OCR provider
patterns/managed-ai-document-intelligence-pipeline-on-aws
patterns/visual-first-document-extraction — alternative pipeline shape