Skip to content

CONCEPT Cited by 1 source

Probabilistic record linkage

Definition

Probabilistic record linkage is the formal-statistical formulation of entity resolution: given two records across (or within) datasets, infer the probability that they refer to the same real-world entity, based on the agreement / disagreement of their fields. The output is a calibrated match probability — not a binary yes / no — and a per-field match weight quantifying each field's evidential contribution.

The canonical formulation is the Fellegi-Sunter model (1969):

  • For each comparison field f (name, address, phone, …), define:
  • m_f = P(field values match | records refer to same entity)
  • u_f = P(field values match | records refer to different entities)
  • The match weight for field f is w_f = log₂(m_f / u_f) — a log-Bayes-factor.
  • The total match weight for a candidate pair is W = Σ_f w_f over fields that match (and log₂((1-m_f)/(1-u_f)) over fields that disagree).
  • Decision: classify the pair as match / non-match / review based on W thresholds.

How m and u are estimated

m_f and u_f cannot be observed directly — labelling all pairs as same-entity or different-entity is the problem we're trying to solve. Two practical approaches:

  1. EM algorithm (unsupervised). Treat the latent same-entity / different-entity label as a hidden variable; iterate expectation-maximisation to estimate m_f and u_f from the data itself. This is the Splink default.
  2. Labelled training pairs (supervised). If a sample of ground-truth matches / non-matches is available, estimate m and u directly. Higher-quality estimates but expensive.

Why it works at scale

  • Calibrated output. A weight of +5 is 32× more likely to be a match than not. The log-Bayes-factor interpretation gives the operator a principled way to set match / review / non-match thresholds based on the cost of a false-positive vs false- negative in the domain.
  • Per-field decomposition. Different fields contribute different amounts of evidence (a matching phone number is worth more than a matching first-name). The decomposition makes it inspectable: an analyst can see why a pair was matched.
  • Composable. Adding a new comparison field is additive — train m_f and u_f for the new field, sum it into W. No retraining of the existing fields.

Relationship to entity resolution

Probabilistic record linkage is the formal-statistical core of classical ER:

  • ER's three-stage decomposition (blocking → matching → canonicalisation) maps onto record linkage as: blocking (limit pairs considered) → record linkage scoring (Fellegi-Sunter) → cluster the matched pairs into canonical records.
  • Hybrid ER (patterns/hybrid-classical-er-plus-genai) uses probabilistic record linkage as the deterministic-traceability half: every match is explained by per-field match weights, fully auditable.
  • GenAI multi-agent ER (patterns/orchestrated-multi-agent-entity-resolution) uses probabilistic record linkage as a calibration substrate: the Reasoning agent's confidence score can be expressed as a match weight, comparable to the classical scoring.

Why probabilistic, not deterministic

Deterministic 1:1 matching (string equality, normalised hash) fails on real-world data because:

  • Sparse evidence. Records routinely have missing fields.
  • Field-level noise. Misspellings, abbreviations, format drift.
  • Vendor-vs-internal naming drift. The Claroty CPS observation generalises: 88% of CPS assets "do not transmit an exact product code" — same drift exists in pharmaceutical SKUs, automotive parts, e-commerce listings.

Probabilistic matching tolerates partial agreement; weights quantify how much evidence each partial agreement carries.

Operational characteristics

  • Pairwise comparison is quadratic. Naïvely, comparing N records means N² pairs. Mitigation: blocking rules prune the pair space to a manageable subset.
  • Skewed pair-distribution. Some blocks are huge (common city / common surname / null phone), some are tiny. Skew at the pairwise-comparison stage produces the canonical curse-of-the-last-reducer straggler problem on Spark; the VF Match FDR pipeline observed 30 minutes worst-case partition vs 52 seconds median — a ~35× ratio.

Seen in

Last updated · 542 distilled / 1,571 read