CONCEPT Cited by 2 sources
Entity resolution¶
Definition¶
Entity resolution (ER) — also called record linkage, deduplication, or identity resolution — is the problem of deciding whether two or more records, observed across different sources or at different times, refer to the same real-world entity. The output is a canonical identifier that consolidates the references and a provenance chain explaining why those records were unified. ER is foundational for master data management, duplicate-customer detection, sanctions screening, asset inventories, supply-chain catalogs, and any system that must reconcile "noisy real-world data into a single source of truth."
Classical ER decomposes into three stages:
- Blocking — partition records into candidate groups so the quadratic match step is tractable (e.g., hash by normalised name prefix; bucket by geohash).
- Matching — apply a similarity function (string distance, numeric proximity, learned model) to pairs within a block; classify as match / non-match / review.
- Clustering / canonicalisation — collapse confirmed matches into a single canonical record with an ID and a provenance log.
Why it's hard¶
- Sparse, noisy evidence. Real-world observations are partial: a CPS device may transmit a model number but not a firmware version; a customer record may have a misspelled name and a partial address. ER systems must triangulate identity from minimal signal.
- Vendor-vs-internal naming drift. The Claroty Team82 observation generalises: 88% of CPS assets "do not transmit an exact product code" and 76% transmit codes "that differ from the vendor's official records." Same drift exists in pharmaceutical product codes, automotive parts catalogs, e-commerce SKUs.
- Long tail of obscure dialects. Most entities follow predictable patterns; the long tail is the failure mode. Generic embeddings tend to underperform on domain-specific jargon, abbreviations, and proprietary encoding schemes.
- Audit-chain requirements. In regulated domains, every identity decision must be explainable: which evidence was weighed, which mapping rule fired, when. The data record alone is not enough; the classifier version must also be auditable.
- Concept drift. New product variants, new firmware versions, and new advisories arrive continuously; an ER system tuned on yesterday's data degrades silently against today's traffic.
Architectural shapes seen in the wiki¶
Pure classical ER (statistical / rule-based)¶
The traditional shape: blocking + similarity functions + deterministic match rules. Strengths: predictable, auditable, fast. Weakness: brittle on the long tail of mixed-format inputs where the rule set under-specifies the cases. Splink (UK Ministry of Justice, Apache 2.0) is the canonical open-source instance on this wiki: implements Fellegi-Sunter probabilistic record linkage with EM-driven match-weight estimation, runs on Spark / DuckDB, scales from millions of records on a laptop to billions on a cluster. See concepts/probabilistic-record-linkage for the formal model.
Pure GenAI ER (single monolithic model)¶
Use a large language model end-to-end: feed two records into the prompt, let the model decide if they refer to the same entity. Strength: handles long tail of natural-language variation. Weaknesses called out by the Claroty source: "Rather than relying on a single monolithic model, Claroty engineered an Orchestrated Multi-Agent System..." — single models can struggle to "parse complex, mixed-format data including protocol-derived naming strings and obscure software markers" alongside "applying confidence scoring and statistical tests to weigh evidence."
Hybrid classical + GenAI¶
The shape Claroty's CPS Library canonicalises (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library): "To achieve high-fidelity deterministic traceability, we moved beyond standard matching algorithms, engineering a hybrid architecture that combines battle-tested, classic ER methods with the cognitive power of Generative AI." See patterns/hybrid-classical-er-plus-genai. Classical methods provide the deterministic-traceability backbone; GenAI agents handle the cognitive parsing of mixed-format strings and unstructured documents.
Orchestrated multi-agent ER¶
A specialisation of the hybrid shape: the GenAI side is itself decomposed into role-specialised agents that collaborate. Claroty's three roles: NLP / Reasoning / Human-in-the-loop. See patterns/orchestrated-multi-agent-entity-resolution.
Properties of a healthy ER system¶
- Deterministic traceability — "every asset record is traceable back to its original raw artifact and the specific mapping version that classified it" (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library). This requires both the data lineage (which raw record contributed) and the classifier lineage (which mapping rule version fired) to be queryable. The Medallion pattern with Delta schema evolution + time travel is the substrate Claroty uses.
- Confidence-scored output, not binary match/no-match. Claroty's Reasoning Agents "apply confidence scoring and statistical tests to weigh evidence." Below a threshold, the decision routes to HITL.
- Human-in-the-loop for low-confidence cases. The output of expert review must feed back into training; otherwise the system never learns the long tail.
- Continuous evaluation against concept drift. Production scoring with LLM-as-judge on real inputs catches degradation before it becomes a customer problem.
- Domain-specific representation. Generic embeddings are often "insufficient for the level of precision we require" in regulated / specialised domains; custom embeddings or fine-tuning is the explicit roadmap (Claroty: "fine-tuning these models as the next logical step to ensure our agents understand the most obscure industrial dialects with deterministic accuracy").
Failure modes¶
- Single canonical ID per entity is wrong granularity. ER on devices may want canonical-at-firmware-version granularity; ER on people may want canonical-at-(name, date-of-birth) granularity. Picking the wrong grain collapses real distinctions or fragments single entities.
- Low-confidence rate too high to staff HITL. If the match rule misses 30% of inputs and the SME backlog unbounded-grows, the system isn't actually doing ER — it's routing all the hard cases to humans.
- No classifier lineage means no rollback. When a mapping-registry update misclassifies an entire batch, recovery requires identifying which records were classified by which version. Without time-travel on the classifier, this is detective work.
- Embedding drift after model retraining. If the embedding model used for similarity is updated without re-embedding the existing index, similarity scores become meaningless across the boundary.
Adjacent concepts¶
- concepts/master-data-management — the broader discipline ER serves; ER is the matching engine, MDM is the organisational shape around it.
- concepts/golden-record — the canonical-record output of an ER consolidation step.
- concepts/agent-identity-resolution-gap — Stripe's observation that current identity infrastructure does not resolve agents; structurally, this is an ER problem on agent identity.
- concepts/source-of-truth-disambiguation — Genie's related problem: when the same business term resolves to multiple definitions across teams; ER's classifier-lineage property applies to disambiguation contexts too.
Seen in¶
-
sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — Second canonical wiki source for ER, first canonical source on the wiki for Splink (open-source probabilistic record linkage framework). Virtue Foundation's VF Match Foundational Data Refresh applies Splink's Fellegi-Sunter probabilistic record linkage to ~thousands of healthcare facility / NGO records aggregated from Overture Maps geospatial authority data + Bright Data-scraped web evidence. Splink's pairwise comparison hit the canonical straggler problem (one Spark partition running 30 minutes vs 52-second median); enabling Photon vectorisation cut worst-case partition by 15× to ~2 minutes. Verbatim ER framing: "The same facility may appear across multiple data sources with name variations, inconsistent addresses, or missing contact details. Traditional deduplication breaks down in these scenarios due to messy data, so we use Splink, an open source probabilistic record linkage framework. ... The result is a unified key per facility, ensuring that end users see one authoritative record for each medical facility and NGO." This is the first open-source-ER instance on the wiki, complementary to Claroty's custom-built hybrid stack — Splink represents the commodity-classical-ER-via-open-source shape that makes ER feasible for non-profit-funded customers.
-
sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Canonical wiki source for hybrid ER. Claroty's AI-Powered CPS Library is the asset-identity layer for xDome, framed explicitly as "an Entity Resolution (ER) challenge" at 17M+-asset scale. Names the hybrid pattern, the three-role multi-agent decomposition, and the audit-chain requirement (deterministic traceability from CPS-ID back to raw evidence + the mapping-registry version that classified it) explicitly. Reported outcomes attributed to the hybrid architecture: 25% improvement in vulnerability identification accuracy, 56% of devices receiving new or updated security recommendations for previously invisible outdated firmware. Quantifies the industry pain (88% of assets don't transmit an exact code, 76% transmit codes that diverge from vendor records).
Related¶
- concepts/master-data-management · concepts/golden-record · concepts/agent-identity-resolution-gap — conceptually adjacent.
- patterns/hybrid-classical-er-plus-genai — the architectural shape.
- patterns/orchestrated-multi-agent-entity-resolution — the multi-agent decomposition.
- concepts/medallion-architecture · concepts/delta-change-data-feed · concepts/schema-evolution — the audit-chain substrate Claroty uses.
- systems/claroty-cps-library — canonical wiki instance.