3.5 KiB
Tree-sitter in Kingfisher Scanning
This document explains how Tree-sitter is used during scanning, and when it is intentionally skipped.
What Tree-sitter Is Used For
Kingfisher always starts with a fast regex pass (Vectorscan/Hyperscan). Tree-sitter is a secondary verification layer used only for context-dependent findings.
The goal is to confirm that a regex hit appears in a plausible code assignment/config context (for example api_key = "...") before keeping the finding.
Where It Runs in the Scan Pipeline
BlobProcessor::rundecides whether to compute a language hint.- It skips language hinting in
turbo_mode. - It also skips when blob size is outside the Tree-sitter window.
- It skips language hinting in
Matcher::scan_blobperforms the primary regex scan and other filtering.maybe_apply_tree_sitter_verificationruns near the end ofscan_blob.- Only candidate matches are checked against Tree-sitter extracted text.
- Matches that fail verification are dropped for context-dependent rules.
Size and Mode Gates
Tree-sitter is attempted only when all of these are true:
- Blob length is between
0 KiBand128 KiB(should_attempt_tree_sitter). turbo_modeis disabled.- A language hint is available.
- The language maps to a supported Tree-sitter grammar + query set.
If any of these conditions fails, Tree-sitter verification is considered unavailable for that blob.
Candidate Selection (Not Every Match)
Tree-sitter verification is only applied to matches that are:
- Classified as
ContextDependentby rule profiling. - Not base64-derived findings (
is_base64 == false).
Classification comes from rule profiles in kingfisher-rules:
SelfIdentifying: keep raw regex result.ContextDependent: may require Tree-sitter confirmation.
How Verification Works
When Tree-sitter is available:
load_tree_sitter_resultsbuilds aCheckerwith:Languageenum value- language-specific queries from
src/parser/queries.rs
Checker::check:- Reuses a thread-local parser cache (
PARSER_CACHE) - Parses source into a syntax tree
- Runs language query patterns capturing
@keyand@val - Produces normalized strings like
key = value - Attempts base64 decode of value and keeps decoded ASCII form when valid
- Reuses a thread-local parser cache (
- For each candidate finding, Kingfisher re-runs that rule's anchored regex on each extracted Tree-sitter text fragment.
- Verification succeeds only when the rule's secret capture equals the original matched secret bytes.
If no extracted fragment verifies the secret, that candidate finding is removed.
Behavior When Tree-sitter Is Unavailable
If Tree-sitter cannot run (size/mode/language/parse errors), Kingfisher keeps the original regex finding.
Supported Languages in This Path
Language mapping for verification currently includes:
bash/shellcc#/csharpc++/cppcssgohtmljavajavascript/jsphppython/py/starlarkrubyrusttomltypescript/tsyaml
The Tree-sitter query definitions for these languages live in src/parser/queries.rs.
Operational Summary
Tree-sitter in Kingfisher is a conditional verifier, not the primary detector:
- Regex finds candidates quickly.
- Rule profiling decides which candidates need context verification.
- Tree-sitter confirms contextual plausibility from parsed syntax.
- If verification cannot run, scan results fall back to the regex pass.
This keeps scanning fast while reducing noisy matches for context-dependent secret patterns.