Interview Presentation
A guideline to walk through the solution in ~20 minutes. Bullet points, not a script — read along, skip where the conversation takes it.
Opening (~1 min)
- The problem. Mercury’s fraud team reviews third-party bank links against customer records. Three signals: name, email, phone. Each messy. Goal: Match vs Mismatch per link.
- What I built. Three solution variants on a shared scoring engine. All three run end-to-end on the provided 9-link fixture.
- Reference. Approach is inspired by Fellegi & Sunter (1969), A Theory for Record Linkage — the founding paper for probabilistic record linkage. I’ll come back to it.
Live walkthrough — run the tests first
uv run --extra dev pytest- 69 tests, all green. Parametrized per-field scorer tests, combiner boundary tests, nickname equivalence-class loading, three end-to-end anchor tests.
- Key test:
tests/test_solution_match.py::test_match_full_output. Spec test — asserts the full printed output ofmercury matchagainst the 9-link ground truth derived from the fraud team’s own comments. I wrote this first, before any scorer, so the whole implementation is TDD against a real-data fixture.
Live walkthrough — run solution 1, match
uv run mercury match interview/mercury-customers.json interview/third-party-banks.json- Binary: 6 Match, 3 Mismatch. Matches the fraud-team judgments on all 9 links.
- Scoring: per-field agreement in [0,1], weighted sum, threshold ≥ 2.5.
- Weights: name 2.5, email 1.5, phone 1.5. Name is biggest because full first+last agreement is two tokens of bio evidence — it’s decisive alone.
Live walkthrough — run solution 2, triage
uv run mercury triage interview/mercury-customers.json interview/third-party-banks.json- Three-tier: 4 Match, 3 Review, 2 Mismatch. Same scoring engine; different rendering of the scalar total.
- Insight this variant exists for: the fraud team’s own language is three-tier. “Looks good!” vs “probably good” vs “going to call the customer” — they were never actually doing binary. Links 2, 6, 8 (where they hedged) are exactly the three links whose
totalscore lands near the binary threshold. The ambiguity was already in the data; we were flattening it. - This is F&S’s A₂ possible link decision, recovered.
Live walkthrough — run solution 3, splink
uv run --extra splink mercury splink interview/mercury-customers.json interview/third-party-banks.json- Probabilistic via the splink library — the production-grade F&S port from the UK Ministry of Justice. DuckDB in-memory backend.
- Optional dependency; lazy-imported so
mercury match/mercury triagerun without it. - Agrees with
triageon 7/9 links. Two disagreements are interpretable:- Link 6 (Cy/Cyril): splink has no nicknames file. Jaro-Winkler can’t close the gap.
- Link 8 (Cyril + phone): splink’s Jaro-Winkler is less cautious than our hand-tuned 0.5 partial-name score.
- This is the honest second opinion. The gap between splink and our hand-tuned matcher is the gap between “picked reasonable weights” and “trained the right weights.”
Design discussion
The scoring engine is separate from the decision rule
- One scoring call per link produces a
LinkResultwith per-field scores and a continuoustotal. - Each solution variant is a different threshold / tiering / backend strategy on that same scalar.
- Cheap to add a fourth solution — just a new CLI subcommand and one reference card.
Fellegi–Sunter
- F&S 1969 shows that under conditional independence, the log-likelihood ratio decomposes into a sum of per-field weights — which is exactly the
NAME_WEIGHT * s_name + ...structure I use. - Our weights are hand-tuned. F&S’s weights are
log(m_k / u_k)estimated from labeled data (EM). Thesplinkvariant is what that looks like done right. - F&S has three decisions: A₁ link, A₂ possible link, A₃ non-link. The prompt asks for binary, so
matchcollapses A₂ into whichever side is closer.triagerestores it. - The paper link and our deviations are documented in matching-approach; post-interview stretch in
interview/plan.mdis a full splink port with EM-trained weights.
What we glossed over
- No probability model — no false-match / false-non-match rate guarantees.
- No frequency-aware token weighting (F&S Corollary 2). Agreement on “Smith” scores the same as agreement on “Windhorst.” This would sharpen links 8 and 9.
- Missing ≡ disagreement. F&S models missing as its own γ realization with its own weight.
- No blocking. Our data is pre-joined by
mercuryCompanyId, so we skip it. - Full anticipated-edge-case list in edge-cases.
How I collaborated with Claude
- Docs-centric. Every concept has a Diataxis card — reference for the what, explanation for the why, how-to for the how. 34 docs, validated by four prek hooks (
docs-check-filenames,docs-check-frontmatter,docs-check-index,docs-check-links). You can browse them by openingdocs_html/over HTTP — seeREADME.mdin the handoff bundle. - TDD with a real-data anchor. The end-to-end test codifying the fraud team’s ground truth was the first test I wrote; it drove every scorer, every threshold tune, every nickname decision. That’s why the scorer suite grew to 69 cases — each one was pulled from either the fixture or a concrete edge case I anticipated.
- Guardrails the harness enforces, not Claude. The repo has:
prekhooks: ruff, ruff-format, ty (Astral’s type checker, wired during this session), pytest, trufflehog, yamllint, docs validation.- Forgejo Actions CI: build + test on Python 3.12/3.13/3.14 matrix, sdist/wheel build validation.
misetasks:transcript,handoff,docs-preview,runner-logs, doc validators, etc.
- Forgejo PR workflow. All work on a single feature branch
interview, 13 commits, one PR for the whole session. That PR was the live-view window into the work as it happened. - Handoff bundle (
mise run handoff --docs-html <tarball>) — flattens source, docs, docs_html, dist, transcript, README into a single.zipfor this debrief.
Closing — what I’d do next
- Train splink’s weights on real labels. Replace our hand-tuned constants with EM-estimated
log(m/u). Gives us error-rate guarantees instead of fixture-fit. - Frequency-aware name weighting. Agreement on rare surnames should count for more. Would sharpen the “spurious shared token” cases (link 9).
- Port the whole pipeline to SQLite-backed splink. Reproducible from a single
.dbfile, include in the handoff bundle. Tracked on the plan.md stretch list. - Cross-link signals. If the same phone or email appears on links for different companies, flag. Classic ring-fraud signal not present in the current fixture.
Q&A handouts
Quick pointers by topic:
- How did you pick the threshold? → scoring-combiner
- Why this scorer and not Jaro-Winkler? → scoring-name — v1 coarseness is deliberate; combiner filters false positives.
- What about Unicode / transliteration / OCR typos? → edge-cases
- Why three solutions? → alternative-solutions
- What’s the relationship to the 1969 paper? → matching-approach
- Can I see the tests? →
tests/test_score.py,tests/test_match.py,tests/test_nicknames.py,tests/test_solution_*.py - Can I see the code? →
src/mercury/— five small modules.match.pyis the combiner,score.pythe per-field scorers,normalize.pythe tokenization primitives,nicknames.pythe equivalence-class loader,cli.pythe subcommand surface.