kingfisher/docs/PARSING.md

51 lines
3.5 KiB
Markdown
Raw Permalink Normal View History

2025-06-24 17:17:16 -07:00
# Kingfisher Source Code Parsing
2026-01-31 21:54:08 -08:00
[← Back to README](../README.md)
2025-06-24 17:17:16 -07:00
Kingfisher uses a parser-based context verifier as a second pass on supported source files. After its initial regex scan (powered by Vectorscan/Hyperscan), it extracts assignment-style snippets from code and configuration files to confirm that generic keyword+token matches appear in plausible contexts.
The implementation favors lightweight extractors over full AST parsing:
- **Handwritten lexers** for common programming and config languages — comment-aware stripping followed by regex-based `key = value` extraction
- **`tl`** for HTML — attribute values, element text, and embedded `<script>` / `<style>` delegation
- **`cssparser`** for CSS — declaration parsing via Mozillas CSS tokenizer
2026-04-09 22:21:02 -07:00
> **History:** Earlier parser implementations relied on 17 statically-linked
> grammar crates. This added ~20 MB to the binary and required building a
> full syntax tree just to extract assignment pairs. The current lexer-based
> approach achieves the same extraction quality with near-zero binary overhead
> and no external grammar dependencies.
2025-06-24 17:17:16 -07:00
2026-03-09 20:11:58 -07:00
## How Its Called
2025-06-24 17:17:16 -07:00
In the scanning phase (in the Matchers implementation), Kingfisher does the following:
2026-03-09 20:11:58 -07:00
- **Primary Regex Pass:** Kingfisher always scans the full blob with Vectorscan/Hyperscan first.
- **Candidate Selection:** Findings from rules classified as context-dependent become parser-verification candidates.
- **Language Detection:** If a language string is provided (for example from metadata or extension), the code maps it to a supported parser backend.
- **Parsing and Querying:** The parser streams normalized snippets such as `key = value` without materializing a full syntax tree.
2026-04-15 14:37:26 -07:00
- **Verification Decision:** Strict contextual candidates are kept only if parser-extracted context verifies the matched secret. More explicit assignment-style rules can still survive on raw regex evidence when parser verification is unavailable.
2025-06-24 17:17:16 -07:00
2026-03-09 20:11:58 -07:00
## Supported Languages
2025-06-24 17:17:16 -07:00
The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:
2026-03-09 20:11:58 -07:00
- **Scripting:** Bash, Python, Ruby, PHP
- **Compiled languages:** C, C++, C#, Rust, Java
- **Web-related languages:** CSS, HTML, JavaScript, TypeScript, YAML, TOML
- **Others:** Go
2025-06-24 17:17:16 -07:00
## When Context Verification Is Not Called
2025-06-24 17:17:16 -07:00
Context verification is skipped in certain cases:
2026-03-09 20:11:58 -07:00
- **No Language Identified:** If the file isnt recognized as belonging to one of the supported languages or no language hint is provided, the context verifier isnt even constructed.
- **Non-source Files:** Binary files or files that arent expected to contain code (or arent extracted from archives) bypass parser-based context verification.
- **Large Blobs:** Files larger than 2 MiB skip context verification to avoid spending time on generated or minified content.
2026-04-15 14:37:26 -07:00
- **Verification Errors:** If extraction fails, rules whose match profile strictly requires parser confirmation are suppressed. Assignment-style contextual rules can still fall back to their raw regex hit.
2025-06-24 17:17:16 -07:00
2026-03-09 20:11:58 -07:00
## Summary
2025-06-24 17:17:16 -07:00
2026-04-15 14:37:26 -07:00
Parser-based context verification is conditional and complementary. It is called only when the scanned file is a supported source or config file, and its role is to reduce noisy strict-context findings by checking them against extracted code/config structure without unnecessarily dropping clear assignment-style secrets from raw text inputs.
2025-06-24 17:17:16 -07:00
This layered approach helps improve the accuracy of secret detection while maintaining high performance.