Edge Cases — Handled and Anticipated
A curated list of what the v1 matcher handles and what it deliberately doesn’t. The “anticipated” list is where the next iteration would invest, roughly in order of likely prevalence and impact.
Handled by the v1 matcher
These are covered by the field normalizers and scorers, and verified by the test suite.
Phone
- Variable formatting:
5557609870,(555)-760-9870,555 098 9870,555-010-9988. Digit-only normalization collapses all. +1and other country codes: last-10-digits rule strips the prefix.- Empty phone arrays on the link side.
- Case and whitespace insensitivity (
John@Example.COM≡john@example.com). contactEmailmatches as well asuser.email(link 5:ram@example.frmatches the company contact but not the user). Candidate set is the union.
Name (personal)
- Title prefixes:
Mr.,Mrs.,Ms.,Dr.,Sir,Prof., etc. - Middle initials (
John B. Smith→John Smith), including when followed by a period. - Extra middle names (
Emmanuel Francisco WindalcoversEmmanuel Windal). - Generational suffixes:
Jr.,Sr.,II,III,IV. - Possessives:
Ram's≡Rams. Apostrophes are stripped in place rather than split on. - Nicknames:
Cy↔Cyril↔Cyrusvia transitive equivalence classes from the supplied nicknames file (see scoring-nicknames).
Name (business)
- Corporate suffixes:
Inc.,LLC,Ltd.,Corp.,Co.,GmbH. Stripped symmetrically on both sides. - Company-vs-person conflation: link names are compared against the union of user names,
tradeName, andlegalName. HandlesInfoLinks Technologies↔InfoLinks Technologies, Inc.. - Spurious single-token overlap (e.g.
"media"in link 9’sIN MEDIA RES PUBLISHINGvsLASSAD MEDIA INC): credited at 0.5, filtered by the combiner threshold rather than by the scorer.
Structural
- Empty link field arrays (
names: [],emails: [],phoneNumbers: []) yield 0.0 for that field. - A link with many names/emails/phones: any single matching entry is sufficient (the scorers short-circuit).
- Pydantic schema validation catches malformed input records before scoring runs.
Anticipated and not yet handled
Cases we could imagine arising in production but deliberately left out of v1.
Higher-priority
- Unicode normalization and transliteration.
JosévsJose,MüllervsMueller,ÜnicodevsUnicode. Would require NFKD + diacritic stripping before tokenization. - Married / maiden names.
Jane SmithvsJane Doe-SmithvsJane Doe. Not captured by token overlap alone; would need a names-lookup or explicit maiden-name field. - OCR-style and typographic errors.
RamvsRarm,SmithvsSmlth. Jaro–Winkler or Levenshtein on tokens would catch most. - Frequency-aware token weighting (Fellegi–Sunter Corollary 2). Agreement on
Windhorstshould count for more than agreement onSmith. Would require a token-frequency model (e.g. US census surname frequencies). - Foreign phone formats. Non-NANP numbers (e.g.
+33 1 23 45 67 89) don’t have a meaningful “last 10 digits” canonical form. - Missing ≡ disagreement. Currently conflated. F&S models “missing” as its own γ realization with its own weight — would likely make link 2 (phone-only agreement, all other fields empty on the link side) score slightly higher but still below threshold.
Middle-priority
- Shared family emails.
family@example.comlegitimately used by multiple relatives; a match against it is weaker evidence than a per-person email. - Plus-addressing and dotted gmail.
john+bank@example.comvsjohn@example.com;j.o.h.n@gmail.comvsjohn@gmail.com. Lightweight local-part canonicalization would fix both. - Domain-only email credit. Currently 0.0; maybe worth a small partial score.
- Ambiguous nicknames.
Cylegitimately maps to Cyrus, Cyril, and Cyrenius via our transitive merging — which meansCy JonesmatchingCyrenius Jonesscores 1.0. That’s probably fine, but worth noting. - Company name aliases and historical names.
MetavsFacebook,GooglevsAlphabet. No heuristic short of a corporate-history dataset. - Soundex / Metaphone phonetic matching for first names (e.g.
CatherinevsKatherine). Captures a subset of what nicknames do but for spelling variation rather than diminutives.
Lower-priority / adversarial
- Deliberate near-misses. A fraudster using
John Smythto link John Smith’s account. Currently a full mismatch on name, but partially defended by corroborating phone/email. - Cross-link signals. Same phone or email appearing on links for different Mercury companies — a strong ring-fraud signal we don’t currently compute.
- Temporal signals. A link submitted minutes after account creation is less trustworthy than one months later. Out of scope for the current record-linkage framing.
Notes on graduated output
The prompt asks for binary Match / Mismatch, which is what the CLI emits. The combiner already produces a continuous total score — routing values near the threshold (e.g. total in [2.0, 3.0]) to a “Review” tier would recover F&S’s A₂ possible link decision at essentially no cost. See matching-approach for the theoretical framing and interview/plan.md for the stretch list.