Skip to content

CONCEPT Cited by 1 source

Dual clustering for document intelligence

Dual clustering is a two-lens document analysis pattern in which every document is clustered simultaneously along two complementary axes — semantic (meaning, via embeddings) and structural (form, via pattern recognition over clause types, formatting, and hierarchy) — and the two clusterings are then projected against each other to produce a unified, enriched document model that grounds downstream LLM extraction.

Canonicalised on the wiki via AArete's Doczy.ai disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws), which positions the dual-clustering engine as the load-bearing contributor to Doczy.ai's 99% extraction accuracy: "It's this convergence that drives the 99% accuracy rate of Doczy.ai. The system doesn't just read the words, it understands the contract."

The two lenses

Lens Mechanism What it captures What it misses without the other
Semantic Embeddings + similarity grouping Meaning, paraphrase equivalence, conceptual relatedness across surface variation Structural distinction (a paraphrased clause at depth 3 looks identical to the same paraphrase at depth 1)
Structural Pattern-recognition algorithms over clause types, table layouts, hierarchical organisation Form, layout, depth, nesting, formatting conventions Semantic equivalence (two clauses with different wording but same meaning look distinct)

Verbatim disclosure (sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws):

"On the semantic side, extracted text is converted into embeddings, numerical representations of meaning, and similar ideas are grouped together even when they're expressed in different words. On the structural side, pattern-recognition algorithms identify clause types, formatting conventions, table layouts, and hierarchical organization, understanding. For example, that a three-nested-level exhibit carries fundamentally different implications than a straightforward attached schedule."

The projection step

The two clusterings don't operate in isolation. Projection algorithms compare them side-by-side and synthesise them into a unified document model:

"These two analyses don't operate in isolation. Projection algorithms compare the semantic and structural clusters side by side, synthesizing them into a unified, enriched document model that captures both meaning and context." ( source)

The exact projection mechanism (in what space, against what loss / scoring function, with what alignment) is AArete IP and not disclosed.

Why two views beat one

Each lens covers a failure mode of the other:

  • Semantic-only fails on documents where structure carries meaning — a clause about "liquidated damages" at exhibit-level depth 3 has different legal implications from the same clause at the top of the contract.
  • Structural-only fails on documents where the same legal meaning is expressed in different ways across versions or authors — paraphrase equivalence is invisible to pure layout analysis.

Together, they let the LLM reason over both what the clause means and where it lives in the contract. This is the structural argument for why dual clustering pushes accuracy from ~55% (rules-based, structural-only) to 99% (LLM with dual-clustered grounding).

Sibling multi-view fusion concepts

Dual clustering sits in a family of multi-view fusion concepts on the wiki that all share the shape "multiple complementary views, projected/fused at query/extraction time, where each view covers another's blind spots":

Concept Source Domain Views fused
Dual clustering for document intelligence Doczy.ai (2026-06-02) Documents Semantic + structural
concepts/multi-source-topology-fusion Netflix Service Topology (2026-05-29) Service dependency graphs eBPF flows + IPC metrics + traces
concepts/index-as-model Meta SilverTorch (2026-05-26) Recsys retrieval Index components fused into one model graph
patterns/three-layer-graph-merge-on-query Netflix Service Topology Graph storage Three physically separate graphs merged at query time

The pattern is portable; the specific lens choice is domain-dependent.

Composition with smart chunking

In Doczy.ai, dual clustering operates on the output of smart chunking — the chunks (with hierarchy and one-to-many metadata preserved) are what get clustered. The semantic lens embeds the chunk text; the structural lens analyses the chunk's metadata + layout signals. Without smart chunking, the structural lens has nothing to look at (flat text loses structure); without dual clustering, smart chunking's metadata preservation isn't put to use.

Where the unified model goes

The output of the projection step is "a unified, enriched document model that captures both meaning and context." In Doczy.ai's pipeline this becomes the grounding context for the LLM prompt — the file class is detected, a domain-tuned prompt is built, and the LLM does structured-output extraction grounded in the dual-clustered representation.

"Advanced large language models (LLMs) then generate structured output grounded in this dual-clustered intelligence." ( source)

When to apply

  • Documents where structure carries meaning independent of text content: contracts, legal agreements, regulatory filings, leases, bond indentures, insurance policies, vendor agreements, healthcare-claims documents.
  • LLM-based extraction targets where naive prompt-based extraction hits a quality ceiling — the dual-clustered grounding closes the gap to ≥ 95 %+ accuracy in Doczy.ai's case.

When not to apply

  • Flat-text documents where structure is absent (chat transcripts, social posts, simple forms).
  • Pure semantic-search use cases where retrieval quality is the goal, not extraction accuracy.
  • Cost-sensitive pipelines where the dual-clustering compute overhead isn't justified by the accuracy gain.

Risks and caveats

  • Mechanism opacity. AArete's projection algorithms, embedding model choice, and pattern-recognition algorithm class are not disclosed publicly.
  • Two failure modes to debug. When extraction fails, attribution between "semantic cluster missed it" and "structural cluster missed it" requires per-lens introspection.
  • Cost. Two clusterings + a projection step is more expensive than single-clustering or no-clustering pipelines; only justified on production-scale workloads with accuracy-sensitive downstream consumers.

Seen in

Last updated · 542 distilled / 1,571 read