SYSTEM Cited by 1 source
Splink¶
Splink is an open-source Python library for probabilistic record linkage (entity resolution / deduplication) at scale. Originally developed by the UK Ministry of Justice analytical services team, released under Apache 2.0. Splink implements the classical Fellegi-Sunter probabilistic matching model with EM-algorithm-driven match-weight estimation, executed over a SQL-pluggable backend (DuckDB, Spark, Athena, SQLite, Postgres) so the same model definition runs from laptop-scale (millions of records) to cluster-scale (billions of records).
Stub page. First wiki disclosure 2026-05-20 via the Databricks + Virtue Foundation post.
Capabilities¶
- Probabilistic match weights via Fellegi-Sunter. For each
comparison column (name, address, phone, email), Splink learns
a pair of probabilities:
m= probability of matching values given the records are the same entity;u= probability of matching values given the records are different entities. The log-Bayes-factorlog(m/u)is the match weight Splink sums across columns to score a candidate pair. - EM-algorithm-driven training. Match weights are learned unsupervised from the data itself via expectation-maximisation — no labelled training pairs required.
- Blocking rules to make pairwise tractable. Splink exposes declarative blocking rules (e.g. "compare pairs where surname matches") that prune the quadratic comparison space to manageable size. The user can layer multiple blocking rules.
- SQL-pluggable backend. The Splink model compiles to SQL; the same definition runs on DuckDB (single-machine, in-memory), Spark (cluster), or others.
- Interactive diagnostics. Built-in visualisations show match-weight distributions, comparison-vector frequencies, and decision-threshold ROC-style curves to tune precision / recall trade-offs.
Architectural role on Databricks¶
In the VF Match Foundational Data Refresh (FDR) pipeline, Splink is the entity resolution stage between the multi-step LLM extraction and the unified canonical-facility table. After GPT-driven extraction emits structured records (per facility per source), Splink:
- Generates candidate match pairs via blocking rules over the extracted records.
- Scores each pair via weighted comparisons across phone / street address / name / etc. (per the figure-2 ruleset referenced in the post).
- Emits a unified key per facility that ties multi-source evidence into a single canonical record.
Splink runs on Spark for the FDR workload — pairwise comparison is inherently skewed (common comparison patterns produce massive partitions), and the FDR team observed the canonical curse-of-the-last-reducer straggler problem: "one Spark partition running for 30 minutes while the median completed in 52 seconds." Enabling Photon (Databricks' vectorised query engine) reduced the worst-case partition from 30 minutes to ~2 minutes — a 15× improvement.
Splink in the wiki's ER taxonomy¶
Splink is the classical-ER half of the hybrid classical + GenAI shape. The classical side provides:
- Deterministic traceability — every match is explained by a per-comparison-column match weight.
- Calibration — Fellegi-Sunter's match weights are log-likelihood-ratio interpretable (5 = 32× more likely to be a match than not).
- Predictability — same input + same model = same output; no LLM prompt-sensitivity.
The GenAI side (NLP / Reasoning agents in patterns/orchestrated-multi-agent-entity-resolution) handles the cognitive parsing of mixed-format strings before they reach Splink for scoring. The two sides compose: agents produce structured candidate records; Splink computes match weights.
Operating envelope¶
Single named scale point: VF Match FDR runs Splink across ~thousands of healthcare facilities and non-profits (verbatim "running probabilistic matching across thousands of healthcare facilities and non-profits"). The post mentions terabyte-scale data volumes for the upstream extraction (25M+ web pages); the ER stage operates on the extracted-record subset.
The post explicitly names the dominant performance challenge: "Running probabilistic matching across thousands of healthcare facilities and non-profits revealed classic performance bottlenecks that emerge at terabyte scale. The core of record linkage is pairwise comparison, which creates inherently skewed workloads."
Why open-source matters here¶
The Databricks for Good context illustrates the value of open-source ER: a non-profit (Virtue Foundation) can ship a production ER pipeline on Splink without a per-record licence fee, with full control over match-weight tuning and blocking rules. Closed-source ER stacks (Informatica, IBM, etc.) would have created a per-record cost structure that does not match the non-profit's funding model.
Caveats¶
- Splink is the matcher, not the orchestrator. Blocking-rule selection, match-weight thresholds, and the post-match canonicalisation of clusters into golden records are still the user's responsibility.
- Performance shape depends heavily on blocking rules. Naïve blocking (e.g. "compare every pair where city matches") can blow up to billions of pairs; tight blocking risks missing matches. Tuning is an ER-specialist task.
- Probabilistic — not deterministic 1:1 matching. Output is a confidence-scored cluster, and the threshold choice is an operational decision. Below-threshold cases need HITL routing in regulated domains.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — Canonical wiki source. VF Match's FDR pipeline uses Splink as the ER stage between GPT-driven extraction and the unified canonical-facility table; runs on Spark; Photon enabled to absorb partition skew (30 min → 2 min worst-case partition, 15× improvement).
Related¶
- concepts/entity-resolution — the problem class.
- concepts/probabilistic-record-linkage — the formal problem-class formulation Splink implements.
- systems/apache-spark — the distributed-execution backend Splink uses on Databricks.
- systems/photon — the vectorised query engine that collapsed Splink's straggler partition by 15× on the FDR workload.
- systems/duckdb — alternative single-machine backend for Splink.
- concepts/curse-of-the-last-reducer — the failure-mode Splink's pairwise-comparison core canonically exhibits.
- patterns/hybrid-classical-er-plus-genai — Splink is the classical-ER half of this pattern's reference instances.
- systems/vf-match — first wiki canonical instance of Splink in production.