CONCEPT Cited by 1 source
Probabilistic record linkage¶
Definition¶
Probabilistic record linkage is the formal-statistical formulation of entity resolution: given two records across (or within) datasets, infer the probability that they refer to the same real-world entity, based on the agreement / disagreement of their fields. The output is a calibrated match probability — not a binary yes / no — and a per-field match weight quantifying each field's evidential contribution.
The canonical formulation is the Fellegi-Sunter model (1969):
- For each comparison field
f(name, address, phone, …), define: m_f= P(field values match | records refer to same entity)u_f= P(field values match | records refer to different entities)- The match weight for field
fisw_f = log₂(m_f / u_f)— a log-Bayes-factor. - The total match weight for a candidate pair is
W = Σ_f w_fover fields that match (andlog₂((1-m_f)/(1-u_f))over fields that disagree). - Decision: classify the pair as match / non-match / review based on
Wthresholds.
How m and u are estimated¶
m_f and u_f cannot be observed directly — labelling all pairs
as same-entity or different-entity is the problem we're trying to
solve. Two practical approaches:
- EM algorithm (unsupervised). Treat the latent same-entity /
different-entity label as a hidden variable; iterate
expectation-maximisation to estimate
m_fandu_ffrom the data itself. This is the Splink default. - Labelled training pairs (supervised). If a sample of
ground-truth matches / non-matches is available, estimate
mandudirectly. Higher-quality estimates but expensive.
Why it works at scale¶
- Calibrated output. A weight of
+5is 32× more likely to be a match than not. The log-Bayes-factor interpretation gives the operator a principled way to set match / review / non-match thresholds based on the cost of a false-positive vs false- negative in the domain. - Per-field decomposition. Different fields contribute different amounts of evidence (a matching phone number is worth more than a matching first-name). The decomposition makes it inspectable: an analyst can see why a pair was matched.
- Composable. Adding a new comparison field is additive — train
m_fandu_ffor the new field, sum it intoW. No retraining of the existing fields.
Relationship to entity resolution¶
Probabilistic record linkage is the formal-statistical core of classical ER:
- ER's three-stage decomposition (blocking → matching → canonicalisation) maps onto record linkage as: blocking (limit pairs considered) → record linkage scoring (Fellegi-Sunter) → cluster the matched pairs into canonical records.
- Hybrid ER (patterns/hybrid-classical-er-plus-genai) uses probabilistic record linkage as the deterministic-traceability half: every match is explained by per-field match weights, fully auditable.
- GenAI multi-agent ER (patterns/orchestrated-multi-agent-entity-resolution) uses probabilistic record linkage as a calibration substrate: the Reasoning agent's confidence score can be expressed as a match weight, comparable to the classical scoring.
Why probabilistic, not deterministic¶
Deterministic 1:1 matching (string equality, normalised hash) fails on real-world data because:
- Sparse evidence. Records routinely have missing fields.
- Field-level noise. Misspellings, abbreviations, format drift.
- Vendor-vs-internal naming drift. The Claroty CPS observation generalises: 88% of CPS assets "do not transmit an exact product code" — same drift exists in pharmaceutical SKUs, automotive parts, e-commerce listings.
Probabilistic matching tolerates partial agreement; weights quantify how much evidence each partial agreement carries.
Operational characteristics¶
- Pairwise comparison is quadratic. Naïvely, comparing N records means N² pairs. Mitigation: blocking rules prune the pair space to a manageable subset.
- Skewed pair-distribution. Some blocks are huge (common city / common surname / null phone), some are tiny. Skew at the pairwise-comparison stage produces the canonical curse-of-the-last-reducer straggler problem on Spark; the VF Match FDR pipeline observed 30 minutes worst-case partition vs 52 seconds median — a ~35× ratio.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — canonical wiki source. VF Match FDR uses Splink as the probabilistic-record-linkage framework, applied to ~thousands of facility / NGO records with weighted comparisons across phone / street address / name fields. Splink's pairwise comparison hit the curse-of-the-last- reducer; Photon cut the worst-case partition by 15×.
Related¶
- concepts/entity-resolution — the broader problem class.
- concepts/golden-record — the canonical-record output of a successful ER run.
- systems/splink — the canonical open-source probabilistic record linkage framework on the wiki.
- concepts/curse-of-the-last-reducer — the operational failure mode pairwise comparison canonically exhibits.
- patterns/hybrid-classical-er-plus-genai — composes probabilistic record linkage with GenAI agents.