Interview Presentation

A guideline to walk through the solution in ~20 minutes. Bullet points, not a script — read along, skip where the conversation takes it.

Opening (~1 min)

  • The problem. Mercury’s fraud team reviews third-party bank links against customer records. Three signals: name, email, phone. Each messy. Goal: Match vs Mismatch per link.
  • What I built. Three solution variants on a shared scoring engine. All three run end-to-end on the provided 9-link fixture.
  • Reference. Approach is inspired by Fellegi & Sunter (1969), A Theory for Record Linkage — the founding paper for probabilistic record linkage. I’ll come back to it.

Live walkthrough — run the tests first

uv run --extra dev pytest
  • 69 tests, all green. Parametrized per-field scorer tests, combiner boundary tests, nickname equivalence-class loading, three end-to-end anchor tests.
  • Key test: tests/test_solution_match.py::test_match_full_output. Spec test — asserts the full printed output of mercury match against the 9-link ground truth derived from the fraud team’s own comments. I wrote this first, before any scorer, so the whole implementation is TDD against a real-data fixture.

Live walkthrough — run solution 1, match

uv run mercury match interview/mercury-customers.json interview/third-party-banks.json
  • Binary: 6 Match, 3 Mismatch. Matches the fraud-team judgments on all 9 links.
  • Scoring: per-field agreement in [0,1], weighted sum, threshold ≥ 2.5.
  • Weights: name 2.5, email 1.5, phone 1.5. Name is biggest because full first+last agreement is two tokens of bio evidence — it’s decisive alone.

Live walkthrough — run solution 2, triage

uv run mercury triage interview/mercury-customers.json interview/third-party-banks.json
  • Three-tier: 4 Match, 3 Review, 2 Mismatch. Same scoring engine; different rendering of the scalar total.
  • Insight this variant exists for: the fraud team’s own language is three-tier. “Looks good!” vs “probably good” vs “going to call the customer” — they were never actually doing binary. Links 2, 6, 8 (where they hedged) are exactly the three links whose total score lands near the binary threshold. The ambiguity was already in the data; we were flattening it.
  • This is F&S’s A₂ possible link decision, recovered.
uv run --extra splink mercury splink interview/mercury-customers.json interview/third-party-banks.json
  • Probabilistic via the splink library — the production-grade F&S port from the UK Ministry of Justice. DuckDB in-memory backend.
  • Optional dependency; lazy-imported so mercury match / mercury triage run without it.
  • Agrees with triage on 7/9 links. Two disagreements are interpretable:
    • Link 6 (Cy/Cyril): splink has no nicknames file. Jaro-Winkler can’t close the gap.
    • Link 8 (Cyril + phone): splink’s Jaro-Winkler is less cautious than our hand-tuned 0.5 partial-name score.
  • This is the honest second opinion. The gap between splink and our hand-tuned matcher is the gap between “picked reasonable weights” and “trained the right weights.”

Design discussion

The scoring engine is separate from the decision rule

  • One scoring call per link produces a LinkResult with per-field scores and a continuous total.
  • Each solution variant is a different threshold / tiering / backend strategy on that same scalar.
  • Cheap to add a fourth solution — just a new CLI subcommand and one reference card.

Fellegi–Sunter

  • F&S 1969 shows that under conditional independence, the log-likelihood ratio decomposes into a sum of per-field weights — which is exactly the NAME_WEIGHT * s_name + ... structure I use.
  • Our weights are hand-tuned. F&S’s weights are log(m_k / u_k) estimated from labeled data (EM). The splink variant is what that looks like done right.
  • F&S has three decisions: A₁ link, A₂ possible link, A₃ non-link. The prompt asks for binary, so match collapses A₂ into whichever side is closer. triage restores it.
  • The paper link and our deviations are documented in matching-approach; post-interview stretch in interview/plan.md is a full splink port with EM-trained weights.

What we glossed over

  • No probability model — no false-match / false-non-match rate guarantees.
  • No frequency-aware token weighting (F&S Corollary 2). Agreement on “Smith” scores the same as agreement on “Windhorst.” This would sharpen links 8 and 9.
  • Missing ≡ disagreement. F&S models missing as its own γ realization with its own weight.
  • No blocking. Our data is pre-joined by mercuryCompanyId, so we skip it.
  • Full anticipated-edge-case list in edge-cases.

How I collaborated with Claude

  • Docs-centric. Every concept has a Diataxis card — reference for the what, explanation for the why, how-to for the how. 34 docs, validated by four prek hooks (docs-check-filenames, docs-check-frontmatter, docs-check-index, docs-check-links). You can browse them by opening docs_html/ over HTTP — see README.md in the handoff bundle.
  • TDD with a real-data anchor. The end-to-end test codifying the fraud team’s ground truth was the first test I wrote; it drove every scorer, every threshold tune, every nickname decision. That’s why the scorer suite grew to 69 cases — each one was pulled from either the fixture or a concrete edge case I anticipated.
  • Guardrails the harness enforces, not Claude. The repo has:
    • prek hooks: ruff, ruff-format, ty (Astral’s type checker, wired during this session), pytest, trufflehog, yamllint, docs validation.
    • Forgejo Actions CI: build + test on Python 3.12/3.13/3.14 matrix, sdist/wheel build validation.
    • mise tasks: transcript, handoff, docs-preview, runner-logs, doc validators, etc.
  • Forgejo PR workflow. All work on a single feature branch interview, 13 commits, one PR for the whole session. That PR was the live-view window into the work as it happened.
  • Handoff bundle (mise run handoff --docs-html <tarball>) — flattens source, docs, docs_html, dist, transcript, README into a single .zip for this debrief.

Closing — what I’d do next

  1. Train splink’s weights on real labels. Replace our hand-tuned constants with EM-estimated log(m/u). Gives us error-rate guarantees instead of fixture-fit.
  2. Frequency-aware name weighting. Agreement on rare surnames should count for more. Would sharpen the “spurious shared token” cases (link 9).
  3. Port the whole pipeline to SQLite-backed splink. Reproducible from a single .db file, include in the handoff bundle. Tracked on the plan.md stretch list.
  4. Cross-link signals. If the same phone or email appears on links for different companies, flag. Classic ring-fraud signal not present in the current fixture.

Q&A handouts

Quick pointers by topic:

  • How did you pick the threshold?scoring-combiner
  • Why this scorer and not Jaro-Winkler?scoring-name — v1 coarseness is deliberate; combiner filters false positives.
  • What about Unicode / transliteration / OCR typos?edge-cases
  • Why three solutions?alternative-solutions
  • What’s the relationship to the 1969 paper?matching-approach
  • Can I see the tests?tests/test_score.py, tests/test_match.py, tests/test_nicknames.py, tests/test_solution_*.py
  • Can I see the code?src/mercury/ — five small modules. match.py is the combiner, score.py the per-field scorers, normalize.py the tokenization primitives, nicknames.py the equivalence-class loader, cli.py the subcommand surface.