Skip to content

PATTERN Cited by 1 source

Orchestrated multi-agent entity resolution

The pattern

Decompose the GenAI side of an Entity Resolution pipeline into role-specialised agents that collaborate, instead of a single monolithic model that tries to do everything. The role decomposition observed in the Claroty CPS Library disclosure is three roles:

  1. NLP Agent"Parse complex, mixed-format data — including protocol-derived naming strings and obscure software markers that standard models often miss." Job: convert messy free-form input into structured candidate fields.

  2. Reasoning Agent"Apply confidence scoring and statistical tests to weigh evidence, discriminating high-fidelity signals from noise to ensure data integrity." Job: assess whether the structured candidate is reliable enough to commit, and emit a calibrated confidence.

  3. Human-in-the-loop (HITL)"A critical feedback mechanism that flags low-confidence mappings for expert to review. The output from these sessions is fed back into the system, retraining the models for continuous accuracy gains." Job: handle the cases the Reasoning Agent flags as below-threshold, and produce labelled training data that feeds the next retrain.

(Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)

The orchestration part is the discipline that the three roles run as a "synchronized network where specialized AI agents collaborate to interpret complex signals," not as independent services. NLP output is the Reasoning input; Reasoning low-confidence is the HITL queue; HITL labels feed the retrain that updates NLP and Reasoning behaviour. There is a closed loop.

Why role-decompose at all

The Claroty source frames the alternative explicitly:

"Rather than relying on a single monolithic model, Claroty engineered an Orchestrated Multi-Agent System..."

Two structural reasons the role-decomposed shape outperforms a single LLM end-to-end on ER:

  • Different cognitive shapes. Parsing messy strings ("1769-L36ERMS/B" → "Compact GuardLogix 5370 controller, firmware variant L3") is a pattern-recognition task. Confidence scoring is a calibration task. Both are LLM capabilities, but a single prompt that asks for both tends to optimise one at the expense of the other.
  • Specialisation supports targeted iteration. The NLP agent can be fine-tuned on protocol-string corpora without affecting the Reasoning agent's calibration; the Reasoning agent can be re-prompted as the confidence-threshold policy changes without re-training the NLP agent.

Cousin patterns: patterns/specialized-agent-decomposition (the general shape from Vercel's v0 / multi-modal agents), patterns/llm-per-subagent-with-optimized-prompts (Genie's Multi-LLM design), concepts/multi-llm-sub-agent-routing (the general concept).

Composition

                ┌─────────────────────────────┐
                │  NLP Agent                   │
                │  - parse mixed-format input  │
                │  - emit structured candidate │
                │  - emit per-field confidence │
                └─────────────────────────────┘
                ┌─────────────────────────────┐
                │  Reasoning Agent             │
                │  - apply statistical tests   │
                │  - aggregate confidence      │
                │  - threshold decision        │
                └─────────────────────────────┘
                  │ above-threshold        │ below-threshold
                  ▼                        ▼
        ┌──────────────────┐   ┌──────────────────────────┐
        │ Classical ER     │   │ Human-in-the-loop (HITL) │
        │ deterministic    │   │ - SME review interface   │
        │ canonicalisation │   │ - corrections to Lakebase│
        └──────────────────┘   │ - feedback to retrain    │
                  │            └──────────────────────────┘
                  ▼                        │
        ┌──────────────────┐               │
        │ Canonical entity │ ◀─────────────┘
        │ (CPS-ID + audit) │   (closed loop: HITL labels
        └──────────────────┘    update NLP + Reasoning
                                 weights)

Required substrate properties

  • Calibrated confidence. The Reasoning agent must produce a useful confidence score — well-correlated with actual match accuracy. Uncalibrated confidence makes the HITL threshold arbitrary.
  • HITL UI tightly coupled to the canonical store. The expert's correction must land in the same operational store that the canonical record reads from, so corrections take effect immediately. Claroty uses systems/databricks-apps + systems/lakebase for this composition.
  • Feedback closure into training. HITL labels must persist to a substrate that the next training cycle reads from (Claroty: "the output from these sessions is fed back into the system, retraining the models"). Without closure, the system never learns the long tail and HITL becomes a permanent cost rather than a transient signal.
  • Production monitoring on agent quality. Drift in any of the three roles affects the others. Continuous LLM-as-judge evaluation against production traffic detects degradation before it becomes customer-visible.

When applies / when doesn't fit

Applies when

  • The ER problem has both a parsing component (messy free text) and a scoring component (which match is right).
  • Domain dialect is specialised enough that fine-tuning a parsing-specific model has measurable accuracy lift over a general-purpose model.
  • The cost of a wrong canonical identity is high enough to justify HITL routing on low-confidence cases.
  • A closed feedback loop from HITL to retraining is organisationally feasible (the SMEs are willing/able to label, and the ML platform can ingest labels into a retrain cycle).

Doesn't fit when

  • The parsing component is trivial (the inputs are already structured).
  • Confidence calibration is unimportant (advisory output, no audit chain).
  • HITL has no operational pathway (no SME availability, no feedback-loop substrate).
  • The domain is small enough that prompt engineering on a single LLM gets you 95% accuracy without the orchestration overhead.

Failure modes

  • Roles collapse into one prompt. When time pressure pushes the team to "just put it all in one prompt", the benefits of specialisation disappear and the system regresses to single-monolithic-model failure modes.
  • HITL becomes a permanent backlog. Without closure into retraining, every retraining cycle starts from the same baseline; the long tail never gets absorbed; SME reviewers burn out or the queue grows unbounded.
  • Reasoning Agent over-confident. Calibration drifts; the threshold no longer separates good from bad; the HITL queue empties for the wrong reason and bad canonicalisations leak into Gold.
  • Reasoning Agent under-confident. Calibration drifts the other way; nearly every case routes to HITL; humans are now the rate limit.
  • NLP and Reasoning trained on stale evidence. New device classes / new firmware variants arrive, NLP doesn't recognise them, Reasoning doesn't have priors for them; both fail gracefully but the system never automates the new population.

Composes with

Seen in

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryCanonical wiki source. Claroty's CPS Library uses the three-role decomposition (NLP Agent + Reasoning Agent
  • Human-in-the-loop) to handle ER on a 17M+-asset catalog of cyber-physical assets. The orchestration sits on Databricks: agents call MLflow-tracked Model Serving endpoints, structured outputs land in Delta tables, confidence-thresholded routing decides between classical matching and HITL review. HITL UI hosted on Databricks Apps + Lakebase. The closed feedback loop into model retraining is asserted, not measured (no cadence / accuracy-delta numbers disclosed).
Last updated · 542 distilled / 1,571 read