kingfisher/docs/PARSING.md
2025-06-24 17:17:16 -07:00

3.3 KiB
Raw Blame History

Kingfisher Source Code Parsing

Kingfisher leverages treesitter as an extra layer of analysis when scanning source files written in supported programming languages. In practice, after its initial regexbased scan (powered by Vectorscan), Kingfisher checks if the files language is known.

If so, it creates a Checker (see below) that uses treesitter to parse the file and run languagespecific queries. This additional pass refines the detection by capturing more structured patterns—such as secret-like tokens—that might be obscured or spread over code constructs.

How Its Called

In the scanning phase (in the Matchers implementation), Kingfisher does the following:

  • Language Detection: When processing a blob, if a language string is provided (e.g. inferred from file metadata or extension), the code calls a helper (via a function like get_language_and_queries) to retrieve the corresponding treesitter language and a set of queries.
  • Checker Creation: With these values, a Checker struct is instantiated. This struct holds both the target language (as defined in its Language enum) and a map of treesitter queries to run.
  • Parsing and Querying: The Checkers key method (e.g. check or indirectly via modify_regex) retrieves a threadlocal treesitter parser (to avoid recreating the parser on every call), sets the appropriate language, and parses the source code into a syntax tree. It then executes the queries over that tree, extracting ranges and texts of interest that might represent secrets.
    (See the implementation details in the parser module for example, the modify_regex function in the Checker, and the conditional treesitter call in Matcher::scan_blob)

Supported Languages

The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:

  • Scripting: Bash, Python, Ruby, PHP
  • Compiled languages: C, C++, C#, Rust, Java
  • Web-related languages: CSS, HTML, JavaScript, TypeScript, YAML, Toml
  • Others: Go, and even a generic “Regex” mode

Each variant maps to its corresponding treesitter language through the get_ts_language() method.

When Treesitter Is Not Called

Treesitter wont be invoked in certain cases:

  • No Language Identified: If the file isnt recognized as belonging to one of the supported languages or no language hint is provided, the Checker isnt even constructed.
  • Non-source Files: Binary files or files that arent expected to contain code (or arent extracted from archives) bypass treesitter parsing.
  • Fallback on Errors: If treesitter parsing fails (e.g. due to malformed code or other errors), Kingfisher will fall back on its regex/Vectorscan matches without the additional treesitter insights.

Summary

In essence, Kingfishers use of treesitter is conditional and complementary. It is called only when the scanned file is a source code file written in a supported language, and its role is to enrich the scanning results by leveraging the syntax tree and language-specific queries. When files are non-source, binary, or if no language is provided, treesitter is not invoked, and Kingfisher relies solely on its regex-based detection.

This layered approach helps improve the accuracy of secret detection while maintaining high performance.