CONCEPT Cited by 1 source

Probabilistic record linkage¶

Definition¶

Probabilistic record linkage is the formal-statistical formulation of entity resolution: given two records across (or within) datasets, infer the probability that they refer to the same real-world entity, based on the agreement / disagreement of their fields. The output is a calibrated match probability — not a binary yes / no — and a per-field match weight quantifying each field's evidential contribution.

The canonical formulation is the Fellegi-Sunter model (1969):

For each comparison field f (name, address, phone, …), define:
m_f = P(field values match | records refer to same entity)
u_f = P(field values match | records refer to different entities)
The match weight for field f is w_f = log₂(m_f / u_f) — a log-Bayes-factor.
The total match weight for a candidate pair is W = Σ_f w_f over fields that match (and log₂((1-m_f)/(1-u_f)) over fields that disagree).
Decision: classify the pair as match / non-match / review based on W thresholds.

How `m` and `u` are estimated¶

m_f and u_f cannot be observed directly — labelling all pairs as same-entity or different-entity is the problem we're trying to solve. Two practical approaches:

EM algorithm (unsupervised). Treat the latent same-entity / different-entity label as a hidden variable; iterate expectation-maximisation to estimate m_f and u_f from the data itself. This is the Splink default.
Labelled training pairs (supervised). If a sample of ground-truth matches / non-matches is available, estimate m and u directly. Higher-quality estimates but expensive.

Why it works at scale¶

Calibrated output. A weight of +5 is 32× more likely to be a match than not. The log-Bayes-factor interpretation gives the operator a principled way to set match / review / non-match thresholds based on the cost of a false-positive vs false- negative in the domain.
Per-field decomposition. Different fields contribute different amounts of evidence (a matching phone number is worth more than a matching first-name). The decomposition makes it inspectable: an analyst can see why a pair was matched.
Composable. Adding a new comparison field is additive — train m_f and u_f for the new field, sum it into W. No retraining of the existing fields.

Relationship to entity resolution¶

Probabilistic record linkage is the formal-statistical core of classical ER:

ER's three-stage decomposition (blocking → matching → canonicalisation) maps onto record linkage as: blocking (limit pairs considered) → record linkage scoring (Fellegi-Sunter) → cluster the matched pairs into canonical records.
Hybrid ER (patterns/hybrid-classical-er-plus-genai) uses probabilistic record linkage as the deterministic-traceability half: every match is explained by per-field match weights, fully auditable.
GenAI multi-agent ER (patterns/orchestrated-multi-agent-entity-resolution) uses probabilistic record linkage as a calibration substrate: the Reasoning agent's confidence score can be expressed as a match weight, comparable to the classical scoring.

Why probabilistic, not deterministic¶

Deterministic 1:1 matching (string equality, normalised hash) fails on real-world data because:

Sparse evidence. Records routinely have missing fields.
Field-level noise. Misspellings, abbreviations, format drift.
Vendor-vs-internal naming drift. The Claroty CPS observation generalises: 88% of CPS assets "do not transmit an exact product code" — same drift exists in pharmaceutical SKUs, automotive parts, e-commerce listings.

Probabilistic matching tolerates partial agreement; weights quantify how much evidence each partial agreement carries.

Operational characteristics¶

Pairwise comparison is quadratic. Naïvely, comparing N records means N² pairs. Mitigation: blocking rules prune the pair space to a manageable subset.
Skewed pair-distribution. Some blocks are huge (common city / common surname / null phone), some are tiny. Skew at the pairwise-comparison stage produces the canonical curse-of-the-last-reducer straggler problem on Spark; the VF Match FDR pipeline observed 30 minutes worst-case partition vs 52 seconds median — a ~35× ratio.

Seen in¶

sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match FDR uses Splink as the probabilistic-record-linkage framework, applied to ~thousands of facility / NGO records with weighted comparisons across phone / street address / name fields. Splink's pairwise comparison hit the curse-of-the-last- reducer; Photon cut the worst-case partition by 15×.

concepts/entity-resolution — the broader problem class.
concepts/golden-record — the canonical-record output of a successful ER run.
systems/splink — the canonical open-source probabilistic record linkage framework on the wiki.
concepts/curse-of-the-last-reducer — the operational failure mode pairwise comparison canonically exhibits.
patterns/hybrid-classical-er-plus-genai — composes probabilistic record linkage with GenAI agents.