PATTERN Cited by 1 source
Hybrid classical ER + GenAI¶
The pattern¶
Build an Entity Resolution (ER) pipeline as a two-track composition:
- Classical track — rule-based / statistical matching with blocking, similarity functions, threshold-based decisions, and deterministic mapping rules. Deterministic, auditable, fast.
- GenAI track — LLM-based agents that handle the cognitive parsing of mixed-format inputs (protocol strings, vendor catalogs, unstructured PDFs, obscure software markers). Flexible on the long tail of natural-language variation, but not deterministic.
Neither track is an alternative to the other. The classical track provides the deterministic-traceability backbone — when the input matches a rule, the output is identical every time, the rule is auditable, and the cost is low. The GenAI track provides the cognitive-parse capability — when the input is a free-form string the rule set under-specifies, the agent extracts structured signal that the classical track can then match deterministically.
"To achieve high-fidelity deterministic traceability, we moved beyond standard matching algorithms, engineering a hybrid architecture that combines battle-tested, classic ER methods with the cognitive power of Generative AI." (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)
Why hybrid, not pure-classical or pure-GenAI¶
Pure-classical fails on the long tail¶
Real-world device evidence is messy: 88% of CPS assets "do not transmit an exact product code," and 76% transmit codes "that differ from the vendor's official records." Classical rules can canonicalise the 12% with exact codes plus a fraction of the divergent set where the divergence pattern is predictable. The remaining tail — proprietary protocol encodings, OEM-specific abbreviations, mixed-format firmware-string variants — is where the rule set runs out.
Pure-GenAI loses determinism + auditability¶
A single end-to-end LLM that consumes raw evidence and emits a canonical identity has two structural problems for ER at scale:
- Non-determinism. The same input can produce different outputs across runs, breaking the "deterministic-traceability" requirement that makes ER useful for vulnerability attribution.
- Cognitive-vs-statistical conflict. "Rather than relying on a single monolithic model, Claroty engineered an Orchestrated Multi-Agent System..." — single models tend to either parse mixed-format strings well or apply confidence scoring + statistical tests well, but not both consistently. The roles benefit from specialisation.
Hybrid keeps the determinism and absorbs the long tail¶
The classical track owns the audit-chain commitment: every canonical record traces deterministically back to a rule firing on a structured input. The GenAI track is upstream of the classical track: it reads unstructured inputs, extracts structured candidate fields, hands them to the classical matcher. Below a confidence threshold, the GenAI track defers to human review (the HITL leg). Above the threshold, the structured output is fed to classical matching and the audit chain re-anchors on a deterministic rule.
Composition¶
┌─────────────────────────────────────────────────────┐
│ Bronze (raw, append-only Delta tables) │
│ - heterogeneous JSON payloads │
│ - protocol-derived strings │
│ - unstructured vendor PDFs │
└─────────────────────────────────────────────────────┘
│
Delta Change Data Feed
│
▼
┌─────────────────────────────────────────────────────┐
│ GenAI track — orchestrated multi-agent system │
│ - NLP Agent: parse mixed-format strings │
│ - Reasoning Agent: confidence + statistical tests │
│ - HITL: low-confidence → expert review → retrain │
└─────────────────────────────────────────────────────┘
│
structured candidate fields + confidence
│
▼
┌─────────────────────────────────────────────────────┐
│ Classical ER track — deterministic mapping registry │
│ - blocking + similarity │
│ - rule-based canonicalisation │
│ - versioned mapping registry (audit lineage) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Silver / Gold canonical entity (CPS-ID) │
│ - traceable to raw artifact (data lineage) │
│ - traceable to mapping version (classifier lineage)│
│ - linked to CVE / CISA / NVD records │
└─────────────────────────────────────────────────────┘
Required substrate properties¶
For the audit chain to actually work end-to-end, the substrate must provide:
- Append-only Bronze with row-level change feed. So the GenAI track can stream the new payloads without rescanning history. Claroty uses Delta + CDF.
- Versioned classifier logic. The mapping registry that the classical track consults must itself be a versioned artifact — a misclassification needs to be traceable to which version fired. Claroty uses Delta's schema evolution + time travel to anchor classifier lineage.
- Confidence scoring at every step. The GenAI agents must emit a confidence; the classical matcher must accept threshold-based rejection. Without this, the HITL leg has no signal to route on.
- Production evaluation against concept drift. Hybrid systems silently degrade as the distribution of incoming evidence shifts. Continuous LLM-as-judge evaluation in production is the substrate Claroty uses to detect degradation before customer impact.
When applies / when doesn't fit¶
Applies when¶
- The input population has a predictable head (rules cover it) and an unpredictable long tail (LLMs needed).
- Deterministic traceability is a hard requirement (regulated, security-critical, audit-driven).
- The output identity has downstream actionable consequences (vulnerability attribution, regulatory filing) so non-determinism on identity is unacceptable.
- The domain has specialised dialect that generic embeddings underperform on (industrial OT, healthcare devices, pharmaceutical product codes).
Doesn't fit when¶
- The input is uniformly clean (a primary-key join is enough; no ER needed).
- The output is advisory rather than authoritative (a search ranker tolerates non-determinism that an ER system does not).
- The volume is so low that two SMEs can resolve cases manually (the cost of the hybrid stack outweighs the benefit).
- The domain has no specialised dialect — generic embeddings + a small rule set are sufficient.
Failure modes¶
- GenAI track fed directly to canonical store. Skipping the classical-matcher reanchor means non-deterministic identities leak into the canonical records. The whole audit-chain commitment collapses.
- Mapping registry not versioned. When the registry is edited in-place, prior canonicalisations cannot be replayed or audited. This breaks the "specific mapping version that classified it" property.
- Confidence threshold too lax. All cases route through classical matching even when the GenAI parse was uncertain — wrong canonicalisations propagate silently.
- Confidence threshold too tight. Most cases route to HITL — the human review backlog grows unbounded and the system is not actually doing automated ER.
- No production-monitoring of judge model. Concept drift in the agents goes undetected; quality degrades silently.
Composes with¶
- patterns/orchestrated-multi-agent-entity-resolution — the typical GenAI-track decomposition: NLP Agent + Reasoning Agent + HITL.
- concepts/medallion-architecture + concepts/delta-change-data-feed — the data substrate for streaming Bronze raw → Silver canonical.
- concepts/schema-evolution + Delta time travel — the audit-chain substrate.
- concepts/llm-as-judge (production-monitoring face) — the continuous-eval substrate against concept drift.
- patterns/llm-judge-as-inline-pipeline-stage — adjacent pattern when the judge runs synchronously inside an ETL step.
Seen in¶
- sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Canonical wiki source. Claroty's AI-Powered CPS Library is built on this pattern: classical ER methods + orchestrated multi-agent GenAI, with the audit-chain substrate provided by Delta CDF + schema evolution + time travel + a versioned mapping registry. The hybrid composition is positioned as the architectural enabler for the reported 25% improvement in vulnerability identification accuracy and 56% of devices receiving new or updated security recommendations.
Related¶
- concepts/entity-resolution — the problem class.
- patterns/orchestrated-multi-agent-entity-resolution — the multi-agent decomposition typically used on the GenAI side of the hybrid.
- systems/claroty-cps-library — canonical instance.
- systems/delta-lake · concepts/delta-change-data-feed · concepts/schema-evolution · concepts/medallion-architecture · concepts/llm-as-judge — the substrate primitives.