kingfisher/docs/PARSING.md
Mick Grove ab811c8bcf v1.87.0
2026-03-09 20:11:58 -07:00

43 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Kingfisher Source Code Parsing
[← Back to README](../README.md)
Kingfisher leverages tree-sitter as an extra layer of analysis when scanning source files written in supported programming languages. In practice, after its initial regex-based scan (powered by Vectorscan/Hyperscan), Kingfisher can run a targeted verification pass for context-dependent rules.
If so, it creates a Checker (see below) that uses treesitter to parse the file and run languagespecific queries. This additional pass refines the detection by capturing more structured patterns—such as secret-like tokens—that might be obscured or spread over code constructs.
## How Its Called
In the scanning phase (in the Matcher's implementation), Kingfisher does the following:
- **Primary Regex Pass:** Kingfisher always scans the full blob with Vectorscan/Hyperscan first.
- **Candidate Selection:** Findings from rules classified as context-dependent become tree-sitter verification candidates.
- **Language Detection:** If a language string is provided (for example from metadata or extension), the code calls a helper (such as `get_language_and_queries`) to retrieve the corresponding tree-sitter language and queries.
- **Checker Creation:** With those values, a `Checker` is instantiated with the target language and query map.
- **Parsing and Querying:** The Checker retrieves a thread-local parser (to avoid recreating it on every call), sets language, parses source, and runs queries to extract structured snippets (for example `key = value` pairs).
- **Verification Decision:** Candidate findings are kept only if parser-extracted context verifies the matched secret. If tree-sitter is unavailable, fallback behavior is profile-driven (for strict generic keyword+token rules, findings are suppressed).
*(See the implementation details in the parser module for example, the `modify_regex` function in the Checker, and the conditional treesitter call in Matcher::scan_blob)*
## Supported Languages
The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:
- **Scripting:** Bash, Python, Ruby, PHP
- **Compiled languages:** C, C++, C#, Rust, Java
- **Web-related languages:** CSS, HTML, JavaScript, TypeScript, YAML, Toml
- **Others:** Go, and even a generic “Regex” mode
Each variant maps to its corresponding treesitter language through the `get_ts_language()` method.
## When Treesitter Is Not Called
Treesitter wont be invoked in certain cases:
- **No Language Identified:** If the file isnt recognized as belonging to one of the supported languages or no language hint is provided, the Checker isnt even constructed.
- **Non-source Files:** Binary files or files that arent expected to contain code (or arent extracted from archives) bypass treesitter parsing.
- **Fallback on Errors:** If treesitter parsing fails (e.g. due to malformed code or other errors), Kingfisher will fall back on its regex/Vectorscan matches without the additional treesitter insights.
## Summary
In essence, Kingfishers use of treesitter is conditional and complementary. It is called only when the scanned file is a source code file written in a supported language, and its role is to enrich the scanning results by leveraging the syntax tree and language-specific queries. When files are non-source, binary, or if no language is provided, treesitter is not invoked, and Kingfisher relies solely on its regex-based detection.
This layered approach helps improve the accuracy of secret detection while maintaining high performance.