kingfisher/docs/CONTEXT_VERIFICATION.md
2026-04-15 14:37:26 -07:00

50 lines
2.7 KiB
Markdown

# Parser-Based Context Verification
[← Back to README](../README.md)
Kingfisher starts with a fast regex pass powered by Vectorscan/Hyperscan. For rules classified as `ContextDependent`, it can then run a lightweight parser-based verification pass that extracts likely assignment-style snippets such as `api_key = secret`.
> **Why not a full AST parser?** Earlier implementations used statically linked
> grammar crates for this step. That added roughly 20 MB to the binary and
> required a full AST parse just to extract `key = value` pairs. The current
> approach — handwritten regex-based lexers with comment-aware stripping —
> produces the same (or better) extraction quality at a fraction of the binary
> and runtime cost.
## Where It Runs
1. `BlobProcessor::run` decides whether to compute a language hint.
2. `Matcher::scan_blob` performs the primary regex scan and other filtering.
3. `maybe_apply_context_verification` streams parser candidates near the end of `scan_blob`.
4. Only context-dependent, non-Base64 matches are checked.
5. Candidates whose match profile strictly requires parser confirmation are removed if they cannot be verified.
## Gates
Context verification runs only when all of these are true:
- Blob length is between `0 KiB` and `2 MiB` (`should_attempt_context_verification`).
- Turbo mode is disabled.
- A supported language hint is available.
If any gate fails, only strict contextual matches are suppressed. Assignment-style contextual rules may still fall back to their raw regex hit when the parser cannot run.
## Backends
Kingfisher uses lightweight language-specific extractors instead of a full AST layer:
- Handwritten lexers for Bash, C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, TOML, TypeScript, and YAML
- `tl` for HTML attributes, element text, and embedded `<script>` / `<style>` blocks
- `cssparser` for CSS declarations and function-style values
Each lexer runs a comment-aware stripping pass (tracking string boundaries to avoid false comment detection) followed by a small set of regex patterns that extract assignment-style pairs.
## Verification Model
- Rule profiling decides which matches are `ContextDependent`.
- A narrower subset of those profiles are treated as parser-mandatory (`strict_contextual_shape`).
- The parser streams candidate text snippets like `secret_key = abcd1234`.
- Kingfisher re-runs the rule's anchored regex against each candidate snippet.
- Verification succeeds only when the regex secret capture exactly matches the original hit.
This keeps the fast regex engine on the hot path while still filtering noisy generic keyword+token matches with language-aware context, without dropping clear assignment-style secrets from raw text files just because no parser backend is available.