Skip to content

CONCEPT Cited by 1 source

Smart chunking

Smart chunking is a document-structure-preserving chunking strategy for LLM-based document intelligence — chunks are constructed to retain hierarchy and one-to-many relationships rather than being produced from a flat token-window slide over the document's text.

The term is canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws), which describes it as "a proprietary approach that goes far beyond pulling words off a page. Rather than treating a document as a flat sequence of text, smart chunking preserves hierarchical structure and one-to-many relationships within documents. It uses a combination of semantic and keyword search to decompose text into meaningful, context-aware chunks, applying dynamic parameters to maintain logical relationships throughout. Sequential identifiers and metadata-driven grouping organize these chunks into field groups, detecting overlaps and removing duplications while keeping the document's natural flow intact."

What "smart" means here

Naive chunking strategies for LLM input:

Strategy Mechanism Failure mode
Fixed-token-window Slide a 4 K / 8 K window over text Cuts mid-clause; loses hierarchy
Sentence-split Split on sentence boundaries Loses paragraph + section context
Paragraph-split Split on \n\n Loses table / list structure
Page-split One chunk per OCR page Page boundaries are layout, not semantic

Smart chunking adds structural awareness on top of these — the chunk boundaries are placed where the document's own structure suggests, not where the token window happens to land. The disclosed properties:

  • Hierarchy preservation — parent / child relationships (clause → sub-clause → exhibit → schedule) survive into chunk metadata.
  • One-to-many relationship preservation — a parent chunk that references multiple child chunks retains those references.
  • Semantic + keyword decomposition — both an embedding similarity signal and a keyword-match signal contribute to where chunk boundaries land.
  • Dynamic parameters — chunk size / overlap / boundary criteria are not fixed constants but adapt to the document being chunked.
  • Sequential identifiers + metadata-driven grouping — chunks carry IDs that let downstream consumers reconstruct the document's logical structure from a flat list of chunks.
  • Overlap detection + duplication removal — overlapping chunks (introduced by sliding-window approaches) are detected and collapsed.

Why structure-preservation matters for LLM extraction

A contract's meaning often depends on where a clause lives — "a three-nested-level exhibit carries fundamentally different implications than a straightforward attached schedule" ( source). A flat-text chunker hands the LLM a clause without that nesting context; the LLM has to reconstruct it from text cues alone (or get it wrong). A smart chunker hands the LLM the clause plus metadata stating "this lives at exhibit-level depth 3 under schedule X under section Y", so the LLM can ground its extraction in the document's actual structure.

This is the structural argument behind the 55%-rules-based → 99%-LLM-with-smart-chunking accuracy step Doczy.ai discloses on the same contract corpus: rules can't reason about structural nesting, and a naive-chunked LLM has to guess at it; a smart-chunked LLM has it as input.

Sibling chunking strategies on the wiki

Strategy Source What it preserves
Smart chunking Doczy.ai (2026-06-02) Hierarchy + one-to-many across chunks
concepts/metadata-only-embedding Yelp CS Chatbot (2026-05-27) Metadata as embedding signal; whole article as retrieval unit
concepts/whole-article-retrieval Yelp (2026-05-27) Article boundary as retrieval unit
concepts/intelligent-document-sampling Databricks Unlocking Archives (2026-05-11) Sample from large doc collection
concepts/multimodal-document-understanding Databricks (2026-05-11) Document as image, not text

Smart chunking is input-side preprocessing; metadata-only embedding is embedding-signal selection; whole-article retrieval is retrieval-unit selection. The three are orthogonal axes of RAG-pipeline design.

Composition with dual clustering

In Doczy.ai, smart chunking output feeds into the dual clustering engine — chunks are then analysed both semantically (embed and group similar meanings) and structurally (pattern-recognise clause types, table layouts, hierarchical depth), and the two views are projected into a unified document model that grounds the LLM prompt. Smart chunking is the substrate that makes dual clustering meaningful — clusters are over chunks-with-metadata, not over flat text.

Mitigates / sibling failure modes

  • concepts/embedding-signal-dilution — chunking too large dilutes embedding signal; smart chunking's dynamic parameters and semantic decomposition prevent this on the embedding side.
  • Lost-context failure of fixed-window chunking — smart chunking's hierarchy preservation is the structural fix.
  • Overlap duplication of sliding-window chunking — smart chunking's overlap detection collapses these.

When to apply

  • Document-intelligence pipelines on contracts, regulatory filings, legal agreements, vendor agreements, healthcare-claims documents — anywhere structure is load-bearing for downstream extraction.
  • Production scale ≥ 100 K documents/period where the accuracy-vs-volume curve makes engineering investment in chunker quality worthwhile.

When not to apply

  • Short, flat documents (single-paragraph tickets, social posts) — hierarchy is absent or trivial.
  • Pure semantic-search use cases over flat text — naive chunking + metadata-only embedding may suffice.
  • Multimodal-direct pipelines that feed document images to a multimodal LLM (patterns/visual-first-document-extraction) — the LLM sees layout directly; pre-chunking is unnecessary.

Mechanism caveat

Smart chunking's mechanism — algorithm class, dynamic-parameter tuning, sequential-identifier scheme, projection-to-field-groups — is AArete IP / patented and not disclosed at any greater depth on the AWS Architecture Blog post. The wiki captures the disclosed properties; implementation internals are out of scope until or unless AArete (or another team replicating the approach) publishes deeper.

Seen in

Last updated · 542 distilled / 1,571 read