CONCEPT Cited by 1 source

Dual clustering for document intelligence¶

Dual clustering is a two-lens document analysis pattern in which every document is clustered simultaneously along two complementary axes — semantic (meaning, via embeddings) and structural (form, via pattern recognition over clause types, formatting, and hierarchy) — and the two clusterings are then projected against each other to produce a unified, enriched document model that grounds downstream LLM extraction.

Canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws), which positions the dual-clustering engine as the load-bearing contributor to Doczy.ai's 99% extraction accuracy: "It's this convergence that drives the 99% accuracy rate of Doczy.ai. The system doesn't just read the words, it understands the contract."

The two lenses¶

Lens	Mechanism	What it captures	What it misses without the other
Semantic	Embeddings + similarity grouping	Meaning, paraphrase equivalence, conceptual relatedness across surface variation	Structural distinction (a paraphrased clause at depth 3 looks identical to the same paraphrase at depth 1)
Structural	Pattern-recognition algorithms over clause types, table layouts, hierarchical organisation	Form, layout, depth, nesting, formatting conventions	Semantic equivalence (two clauses with different wording but same meaning look distinct)

Verbatim disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws):

"On the semantic side, extracted text is converted into embeddings, numerical representations of meaning, and similar ideas are grouped together even when they're expressed in different words. On the structural side, pattern-recognition algorithms identify clause types, formatting conventions, table layouts, and hierarchical organization, understanding. For example, that a three-nested-level exhibit carries fundamentally different implications than a straightforward attached schedule."

The projection step¶

The two clusterings don't operate in isolation. Projection algorithms compare them side-by-side and synthesise them into a unified document model:

"These two analyses don't operate in isolation. Projection algorithms compare the semantic and structural clusters side by side, synthesizing them into a unified, enriched document model that captures both meaning and context." ( source)

The exact projection mechanism (in what space, against what loss / scoring function, with what alignment) is AArete IP and not disclosed.

Why two views beat one¶

Each lens covers a failure mode of the other:

Semantic-only fails on documents where structure carries meaning — a clause about "liquidated damages" at exhibit-level depth 3 has different legal implications from the same clause at the top of the contract.
Structural-only fails on documents where the same legal meaning is expressed in different ways across versions or authors — paraphrase equivalence is invisible to pure layout analysis.

Together, they let the LLM reason over both what the clause means and where it lives in the contract. This is the structural argument for why dual clustering pushes accuracy from ~55% (rules-based, structural-only) to 99% (LLM with dual-clustered grounding).

Sibling multi-view fusion concepts¶

Dual clustering sits in a family of multi-view fusion concepts on the wiki that all share the shape "multiple complementary views, projected/fused at query/extraction time, where each view covers another's blind spots":

Concept	Source	Domain	Views fused
Dual clustering for document intelligence	Doczy.ai (2026-06-02)	Documents	Semantic + structural
concepts/multi-source-topology-fusion	Netflix Service Topology (2026-05-29)	Service dependency graphs	eBPF flows + IPC metrics + traces
concepts/index-as-model	Meta SilverTorch (2026-05-26)	Recsys retrieval	Index components fused into one model graph
patterns/three-layer-graph-merge-on-query	Netflix Service Topology	Graph storage	Three physically separate graphs merged at query time

The pattern is portable; the specific lens choice is domain-dependent.

Composition with smart chunking¶

In Doczy.ai, dual clustering operates on the output of smart chunking — the chunks (with hierarchy and one-to-many metadata preserved) are what get clustered. The semantic lens embeds the chunk text; the structural lens analyses the chunk's metadata + layout signals. Without smart chunking, the structural lens has nothing to look at (flat text loses structure); without dual clustering, smart chunking's metadata preservation isn't put to use.

Where the unified model goes¶

The output of the projection step is "a unified, enriched document model that captures both meaning and context." In Doczy.ai's pipeline this becomes the grounding context for the LLM prompt — the file class is detected, a domain-tuned prompt is built, and the LLM does structured-output extraction grounded in the dual-clustered representation.

"Advanced large language models (LLMs) then generate structured output grounded in this dual-clustered intelligence." ( source)

When to apply¶

Documents where structure carries meaning independent of text content: contracts, legal agreements, regulatory filings, leases, bond indentures, insurance policies, vendor agreements, healthcare-claims documents.
LLM-based extraction targets where naive prompt-based extraction hits a quality ceiling — the dual-clustered grounding closes the gap to ≥ 95 %+ accuracy in Doczy.ai's case.

When not to apply¶

Flat-text documents where structure is absent (chat transcripts, social posts, simple forms).
Pure semantic-search use cases where retrieval quality is the goal, not extraction accuracy.
Cost-sensitive pipelines where the dual-clustering compute overhead isn't justified by the accuracy gain.

Risks and caveats¶

Mechanism opacity. AArete's projection algorithms, embedding model choice, and pattern-recognition algorithm class are not disclosed publicly.
Two failure modes to debug. When extraction fails, attribution between "semantic cluster missed it" and "structural cluster missed it" requires per-lens introspection.
Cost. Two clusterings + a projection step is more expensive than single-clustering or no-clustering pipelines; only justified on production-scale workloads with accuracy-sensitive downstream consumers.

Seen in¶

sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — canonical wiki disclosure as the load-bearing accuracy contributor for Doczy.ai's 99% production accuracy on contract extraction; named as the "two-lens methodology" with explicit projection-algorithms framing.

concepts/smart-chunking — upstream substrate that produces the chunks dual clustering operates on
concepts/multi-source-topology-fusion — sibling multi-view fusion at the service-graph altitude
concepts/index-as-model — sibling multi-component fusion at the recsys-retrieval altitude
concepts/embedding-signal-dilution — failure mode the semantic lens helps avoid via projection-against-structural
systems/doczy-ai
patterns/managed-ai-document-intelligence-pipeline-on-aws