Skip to content

PATTERN Cited by 1 source

Hybrid classical ER + GenAI

The pattern

Build an Entity Resolution (ER) pipeline as a two-track composition:

  • Classical track — rule-based / statistical matching with blocking, similarity functions, threshold-based decisions, and deterministic mapping rules. Deterministic, auditable, fast.
  • GenAI track — LLM-based agents that handle the cognitive parsing of mixed-format inputs (protocol strings, vendor catalogs, unstructured PDFs, obscure software markers). Flexible on the long tail of natural-language variation, but not deterministic.

Neither track is an alternative to the other. The classical track provides the deterministic-traceability backbone — when the input matches a rule, the output is identical every time, the rule is auditable, and the cost is low. The GenAI track provides the cognitive-parse capability — when the input is a free-form string the rule set under-specifies, the agent extracts structured signal that the classical track can then match deterministically.

"To achieve high-fidelity deterministic traceability, we moved beyond standard matching algorithms, engineering a hybrid architecture that combines battle-tested, classic ER methods with the cognitive power of Generative AI." (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)

Why hybrid, not pure-classical or pure-GenAI

Pure-classical fails on the long tail

Real-world device evidence is messy: 88% of CPS assets "do not transmit an exact product code," and 76% transmit codes "that differ from the vendor's official records." Classical rules can canonicalise the 12% with exact codes plus a fraction of the divergent set where the divergence pattern is predictable. The remaining tail — proprietary protocol encodings, OEM-specific abbreviations, mixed-format firmware-string variants — is where the rule set runs out.

Pure-GenAI loses determinism + auditability

A single end-to-end LLM that consumes raw evidence and emits a canonical identity has two structural problems for ER at scale:

  1. Non-determinism. The same input can produce different outputs across runs, breaking the "deterministic-traceability" requirement that makes ER useful for vulnerability attribution.
  2. Cognitive-vs-statistical conflict. "Rather than relying on a single monolithic model, Claroty engineered an Orchestrated Multi-Agent System..." — single models tend to either parse mixed-format strings well or apply confidence scoring + statistical tests well, but not both consistently. The roles benefit from specialisation.

Hybrid keeps the determinism and absorbs the long tail

The classical track owns the audit-chain commitment: every canonical record traces deterministically back to a rule firing on a structured input. The GenAI track is upstream of the classical track: it reads unstructured inputs, extracts structured candidate fields, hands them to the classical matcher. Below a confidence threshold, the GenAI track defers to human review (the HITL leg). Above the threshold, the structured output is fed to classical matching and the audit chain re-anchors on a deterministic rule.

Composition

┌─────────────────────────────────────────────────────┐
│  Bronze (raw, append-only Delta tables)             │
│  - heterogeneous JSON payloads                      │
│  - protocol-derived strings                         │
│  - unstructured vendor PDFs                         │
└─────────────────────────────────────────────────────┘
            Delta Change Data Feed
┌─────────────────────────────────────────────────────┐
│  GenAI track — orchestrated multi-agent system      │
│  - NLP Agent: parse mixed-format strings            │
│  - Reasoning Agent: confidence + statistical tests  │
│  - HITL: low-confidence → expert review → retrain   │
└─────────────────────────────────────────────────────┘
        structured candidate fields + confidence
┌─────────────────────────────────────────────────────┐
│  Classical ER track — deterministic mapping registry │
│  - blocking + similarity                            │
│  - rule-based canonicalisation                      │
│  - versioned mapping registry (audit lineage)       │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│  Silver / Gold canonical entity (CPS-ID)             │
│  - traceable to raw artifact (data lineage)         │
│  - traceable to mapping version (classifier lineage)│
│  - linked to CVE / CISA / NVD records               │
└─────────────────────────────────────────────────────┘

Required substrate properties

For the audit chain to actually work end-to-end, the substrate must provide:

  1. Append-only Bronze with row-level change feed. So the GenAI track can stream the new payloads without rescanning history. Claroty uses Delta + CDF.
  2. Versioned classifier logic. The mapping registry that the classical track consults must itself be a versioned artifact — a misclassification needs to be traceable to which version fired. Claroty uses Delta's schema evolution + time travel to anchor classifier lineage.
  3. Confidence scoring at every step. The GenAI agents must emit a confidence; the classical matcher must accept threshold-based rejection. Without this, the HITL leg has no signal to route on.
  4. Production evaluation against concept drift. Hybrid systems silently degrade as the distribution of incoming evidence shifts. Continuous LLM-as-judge evaluation in production is the substrate Claroty uses to detect degradation before customer impact.

When applies / when doesn't fit

Applies when

  • The input population has a predictable head (rules cover it) and an unpredictable long tail (LLMs needed).
  • Deterministic traceability is a hard requirement (regulated, security-critical, audit-driven).
  • The output identity has downstream actionable consequences (vulnerability attribution, regulatory filing) so non-determinism on identity is unacceptable.
  • The domain has specialised dialect that generic embeddings underperform on (industrial OT, healthcare devices, pharmaceutical product codes).

Doesn't fit when

  • The input is uniformly clean (a primary-key join is enough; no ER needed).
  • The output is advisory rather than authoritative (a search ranker tolerates non-determinism that an ER system does not).
  • The volume is so low that two SMEs can resolve cases manually (the cost of the hybrid stack outweighs the benefit).
  • The domain has no specialised dialect — generic embeddings + a small rule set are sufficient.

Failure modes

  • GenAI track fed directly to canonical store. Skipping the classical-matcher reanchor means non-deterministic identities leak into the canonical records. The whole audit-chain commitment collapses.
  • Mapping registry not versioned. When the registry is edited in-place, prior canonicalisations cannot be replayed or audited. This breaks the "specific mapping version that classified it" property.
  • Confidence threshold too lax. All cases route through classical matching even when the GenAI parse was uncertain — wrong canonicalisations propagate silently.
  • Confidence threshold too tight. Most cases route to HITL — the human review backlog grows unbounded and the system is not actually doing automated ER.
  • No production-monitoring of judge model. Concept drift in the agents goes undetected; quality degrades silently.

Composes with

Seen in

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryCanonical wiki source. Claroty's AI-Powered CPS Library is built on this pattern: classical ER methods + orchestrated multi-agent GenAI, with the audit-chain substrate provided by Delta CDF + schema evolution + time travel + a versioned mapping registry. The hybrid composition is positioned as the architectural enabler for the reported 25% improvement in vulnerability identification accuracy and 56% of devices receiving new or updated security recommendations.
Last updated · 542 distilled / 1,571 read