Edge Cases — Handled and Anticipated

A curated list of what the v1 matcher handles and what it deliberately doesn’t. The “anticipated” list is where the next iteration would invest, roughly in order of likely prevalence and impact.

Handled by the v1 matcher

These are covered by the field normalizers and scorers, and verified by the test suite.

Phone

  • Variable formatting: 5557609870, (555)-760-9870, 555 098 9870, 555-010-9988. Digit-only normalization collapses all.
  • +1 and other country codes: last-10-digits rule strips the prefix.
  • Empty phone arrays on the link side.

Email

  • Case and whitespace insensitivity ( John@Example.COM john@example.com).
  • contactEmail matches as well as user.email (link 5: ram@example.fr matches the company contact but not the user). Candidate set is the union.

Name (personal)

  • Title prefixes: Mr., Mrs., Ms., Dr., Sir, Prof., etc.
  • Middle initials (John B. SmithJohn Smith), including when followed by a period.
  • Extra middle names (Emmanuel Francisco Windal covers Emmanuel Windal).
  • Generational suffixes: Jr., Sr., II, III, IV.
  • Possessives: Ram'sRams. Apostrophes are stripped in place rather than split on.
  • Nicknames: CyCyrilCyrus via transitive equivalence classes from the supplied nicknames file (see scoring-nicknames).

Name (business)

  • Corporate suffixes: Inc., LLC, Ltd., Corp., Co., GmbH. Stripped symmetrically on both sides.
  • Company-vs-person conflation: link names are compared against the union of user names, tradeName, and legalName. Handles InfoLinks TechnologiesInfoLinks Technologies, Inc..
  • Spurious single-token overlap (e.g. "media" in link 9’s IN MEDIA RES PUBLISHING vs LASSAD MEDIA INC): credited at 0.5, filtered by the combiner threshold rather than by the scorer.

Structural

  • Empty link field arrays (names: [], emails: [], phoneNumbers: []) yield 0.0 for that field.
  • A link with many names/emails/phones: any single matching entry is sufficient (the scorers short-circuit).
  • Pydantic schema validation catches malformed input records before scoring runs.

Anticipated and not yet handled

Cases we could imagine arising in production but deliberately left out of v1.

Higher-priority

  • Unicode normalization and transliteration. José vs Jose, Müller vs Mueller, Ünicode vs Unicode. Would require NFKD + diacritic stripping before tokenization.
  • Married / maiden names. Jane Smith vs Jane Doe-Smith vs Jane Doe. Not captured by token overlap alone; would need a names-lookup or explicit maiden-name field.
  • OCR-style and typographic errors. Ram vs Rarm, Smith vs Smlth. Jaro–Winkler or Levenshtein on tokens would catch most.
  • Frequency-aware token weighting (Fellegi–Sunter Corollary 2). Agreement on Windhorst should count for more than agreement on Smith. Would require a token-frequency model (e.g. US census surname frequencies).
  • Foreign phone formats. Non-NANP numbers (e.g. +33 1 23 45 67 89) don’t have a meaningful “last 10 digits” canonical form.
  • Missing ≡ disagreement. Currently conflated. F&S models “missing” as its own γ realization with its own weight — would likely make link 2 (phone-only agreement, all other fields empty on the link side) score slightly higher but still below threshold.

Middle-priority

  • Shared family emails. family@example.com legitimately used by multiple relatives; a match against it is weaker evidence than a per-person email.
  • Plus-addressing and dotted gmail. john+bank@example.com vs john@example.com; j.o.h.n@gmail.com vs john@gmail.com. Lightweight local-part canonicalization would fix both.
  • Domain-only email credit. Currently 0.0; maybe worth a small partial score.
  • Ambiguous nicknames. Cy legitimately maps to Cyrus, Cyril, and Cyrenius via our transitive merging — which means Cy Jones matching Cyrenius Jones scores 1.0. That’s probably fine, but worth noting.
  • Company name aliases and historical names. Meta vs Facebook, Google vs Alphabet. No heuristic short of a corporate-history dataset.
  • Soundex / Metaphone phonetic matching for first names (e.g. Catherine vs Katherine). Captures a subset of what nicknames do but for spelling variation rather than diminutives.

Lower-priority / adversarial

  • Deliberate near-misses. A fraudster using John Smyth to link John Smith’s account. Currently a full mismatch on name, but partially defended by corroborating phone/email.
  • Cross-link signals. Same phone or email appearing on links for different Mercury companies — a strong ring-fraud signal we don’t currently compute.
  • Temporal signals. A link submitted minutes after account creation is less trustworthy than one months later. Out of scope for the current record-linkage framing.

Notes on graduated output

The prompt asks for binary Match / Mismatch, which is what the CLI emits. The combiner already produces a continuous total score — routing values near the threshold (e.g. total in [2.0, 3.0]) to a “Review” tier would recover F&S’s A₂ possible link decision at essentially no cost. See matching-approach for the theoretical framing and interview/plan.md for the stretch list.