Solution — splink (probabilistic)
uv sync --extra splink
uv run mercury splink <customers.json> <banks.json>Ports the matching decision onto splink, the actively-maintained Fellegi–Sunter library from the UK Ministry of Justice. Splink implements the paper properly — likelihood-ratio scoring, EM-estimated m/u probabilities, blocking, all of it — backed by DuckDB / Spark / Postgres / SQLite.
Why this variant exists
Our match and triage solutions implement a hand-tuned approximation of the F&S decomposition (see matching-approach). Splink is what that decomposition looks like when it’s done right. Running both side-by-side on the same 9-link fixture is a cheap way to sanity-check the hand-tuned thresholds and to feature-gate the interesting future work (EM training, blocking, graduated probability output).
How it’s wired
The splink port is isolated from the rest of the package:
splinkis an optional dependency — install withuv sync --extra splink.- The implementation lives in
src/mercury/solution_splink.py; the top-levelsplink/pandasimports happen insiderun(), not at module import. The rest of the package has no awareness that splink exists. - The CLI subcommand lazy-imports
mercury.solution_splinkonly when invoked.mercury matchandmercury triagerun on a plainuv syncwithout splink present. - Missing-dependency failure mode is a clean error with install instructions, not a traceback.
Data shape
Splink compares records, so we flatten the two nested input documents into row tables:
- Customers: one row per (company × user) plus one business row per company. Each row has
(unique_id, mercuryCompanyId, name, email, phone). - Links: one row per link, with the first/primary entry from the
names,emails,phoneNumbersarrays.
Linkage blocks on mercuryCompanyId, which reproduces our match/triage constraint that only within-company pairs are considered candidates.
Comparisons
Using splink’s built-in comparison library:
cl.JaroWinklerAtThresholds("name")— fuzzy name agreement at a few distance tiers.cl.ExactMatch("email")— exact string equality on normalized email.cl.ExactMatch("phone")— exact equality on the digit-only 10-digit form.
No EM training. With nine labeled links, EM estimation would be pure noise. Splink uses library-default m/u probabilities.
Backend
DuckDB in-memory (DuckDBAPI()). Two reasons:
- Every splink comparison function works on DuckDB;
SQLiteAPIdrops some of them. - No file lifecycle to manage.
Swapping to file-backed SQLite or DuckDB is a one-line change — both support the same Linker API — if you want reproducibility-from-a-single-file for the handoff bundle.
Output
Per link: the maximum match probability across that link’s candidate pairs (one per flattened customer view), banded the same way as solution-triage:
p >= 0.9 → Match
p >= 0.5 → Review
p < 0.5 → Mismatch
Prior calibration
Splink’s default probability_two_random_records_match is 0.0001 — the global rate, which would be right if we were scoring every possible (customer, link) cross product. But we block on mercuryCompanyId, so our candidate pairs are all within-company; the prior inside each block is much higher. We set it to 0.5 in SettingsCreator. Without this, splink’s posterior stays pinned near the prior and only links with all three fields agreeing clear the p≥0.9 threshold (on the fixture, that’s only link 1).
This is the one piece of domain knowledge that has to come from us, not from splink — splink can’t know how aggressively the caller blocked.
Sample output on the 9-link fixture
Total matches: 5
Total reviews: 1
Total mismatches: 3
Link 1: Match (p=1.000)
Link 2: Review (p=0.500)
Link 3: Match (p=0.941)
Link 4: Mismatch (p=0.000)
Link 5: Match (p=0.996)
Link 6: Mismatch (p=0.000)
Link 7: Match (p=0.941)
Link 8: Match (p=0.941)
Link 9: Mismatch (p=0.000)
Comparison with triage
| Link | solution-triage | splink |
|---|---|---|
| 1 | Match | Match (1.000) |
| 2 | Review | Review (0.500) |
| 3 | Match | Match (0.941) |
| 4 | Mismatch | Mismatch (0.000) |
| 5 | Match | Match (0.996) |
| 6 | Review | Mismatch (0.000) |
| 7 | Match | Match (0.941) |
| 8 | Review | Match (0.941) |
| 9 | Mismatch | Mismatch (0.000) |
Two disagreements, both interpretable:
- Link 6 — Cy/Cyril. Splink has no nickname data wired in. Jaro-Winkler on “Cy” vs “Cyril” is not high enough to offset the lack of phone or email match. Our scoring-nicknames module is what carries the day for
triage. Fixing splink would mean either (a) pre-canonicalizing names with our nicknames map before feeding splink, or (b) writing a customComparisonthat nicknames-expand internally. - Link 8 — “Cyril” alone + phone. Splink’s Jaro-Winkler treats the partial name as a meaningful agreement; combined with a phone match, p=0.941 clears 0.9. Our hand-tuned
score_namegives partial-name a flat 0.5 regardless of which token agrees, sotriagelands attotal=2.75and goes to Review. Which is right depends on whether you believe “Cyril” alone is specific enough to be decisive — splink effectively says yes, we hedge.
What this tells us
The splink variant is an honest second opinion, and on 7 of 9 links it agrees with our hand-tuned triage. The two disagreements both pinpoint places where the hand-tuned matcher carries domain knowledge splink doesn’t have yet: the nicknames file (link 6) and the deliberate coarse-grained partial-name score (link 8, where we chose caution).
The gap between splink’s output and ours is exactly the gap between “we picked reasonable weights” and “we trained the right weights.” Closing it requires either:
- EM training on a larger labeled corpus — real Mercury production data, where EM has enough observations to estimate m/u per comparison level.
- Setting m/u explicitly in
SettingsCreatorfrom prior knowledge. Our hand-tunedNAME_WEIGHT=2.5 / PHONE_WEIGHT=1.5 / EMAIL_WEIGHT=1.5is an unprincipled version of exactly this — the splink port is the invitation to replace those constants with justified ones. - Feeding splink our nickname canonicalization so link 6 is not a pure handicap. Low-effort, high-value.
Stretch: what a real port would add
- Train m with labels (ground truth from the fraud team) and u with random-sampling estimation.
- Expose the per-field agreement weights as a diagnostic output, so the splink solution can compete with solution-triage on the “why this verdict?” axis.
- Wire a SQLite file backend and include the db in the handoff bundle, so the whole pipeline is inspectable without re-running.
See interview/plan.md for the tracking stretch goal.