kingfisher/docs/PARSING.md

43 lines
3.6 KiB
Markdown
Raw Permalink Normal View History

2025-06-24 17:17:16 -07:00
# Kingfisher Source Code Parsing
2026-01-31 21:54:08 -08:00
[← Back to README](../README.md)
2026-03-09 20:11:58 -07:00
Kingfisher leverages tree-sitter as an extra layer of analysis when scanning source files written in supported programming languages. In practice, after its initial regex-based scan (powered by Vectorscan/Hyperscan), Kingfisher can run a targeted verification pass for context-dependent rules.
2025-06-24 17:17:16 -07:00
If so, it creates a Checker (see below) that uses treesitter to parse the file and run languagespecific queries. This additional pass refines the detection by capturing more structured patterns—such as secret-like tokens—that might be obscured or spread over code constructs.
2026-03-09 20:11:58 -07:00
## How Its Called
2025-06-24 17:17:16 -07:00
2026-03-09 20:11:58 -07:00
In the scanning phase (in the Matcher's implementation), Kingfisher does the following:
- **Primary Regex Pass:** Kingfisher always scans the full blob with Vectorscan/Hyperscan first.
- **Candidate Selection:** Findings from rules classified as context-dependent become tree-sitter verification candidates.
- **Language Detection:** If a language string is provided (for example from metadata or extension), the code calls a helper (such as `get_language_and_queries`) to retrieve the corresponding tree-sitter language and queries.
- **Checker Creation:** With those values, a `Checker` is instantiated with the target language and query map.
- **Parsing and Querying:** The Checker retrieves a thread-local parser (to avoid recreating it on every call), sets language, parses source, and runs queries to extract structured snippets (for example `key = value` pairs).
- **Verification Decision:** Candidate findings are kept only if parser-extracted context verifies the matched secret. If tree-sitter is unavailable, fallback behavior is profile-driven (for strict generic keyword+token rules, findings are suppressed).
2025-06-24 17:17:16 -07:00
*(See the implementation details in the parser module for example, the `modify_regex` function in the Checker, and the conditional treesitter call in Matcher::scan_blob)*
2026-03-09 20:11:58 -07:00
## Supported Languages
2025-06-24 17:17:16 -07:00
The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:
2026-03-09 20:11:58 -07:00
2025-06-24 17:17:16 -07:00
- **Scripting:** Bash, Python, Ruby, PHP
- **Compiled languages:** C, C++, C#, Rust, Java
- **Web-related languages:** CSS, HTML, JavaScript, TypeScript, YAML, Toml
- **Others:** Go, and even a generic “Regex” mode
Each variant maps to its corresponding treesitter language through the `get_ts_language()` method.
2026-03-09 20:11:58 -07:00
## When Treesitter Is Not Called
2025-06-24 17:17:16 -07:00
Treesitter wont be invoked in certain cases:
2026-03-09 20:11:58 -07:00
2025-06-24 17:17:16 -07:00
- **No Language Identified:** If the file isnt recognized as belonging to one of the supported languages or no language hint is provided, the Checker isnt even constructed.
- **Non-source Files:** Binary files or files that arent expected to contain code (or arent extracted from archives) bypass treesitter parsing.
- **Fallback on Errors:** If treesitter parsing fails (e.g. due to malformed code or other errors), Kingfisher will fall back on its regex/Vectorscan matches without the additional treesitter insights.
2026-03-09 20:11:58 -07:00
## Summary
2025-06-24 17:17:16 -07:00
In essence, Kingfishers use of treesitter is conditional and complementary. It is called only when the scanned file is a source code file written in a supported language, and its role is to enrich the scanning results by leveraging the syntax tree and language-specific queries. When files are non-source, binary, or if no language is provided, treesitter is not invoked, and Kingfisher relies solely on its regex-based detection.
This layered approach helps improve the accuracy of secret detection while maintaining high performance.