kingfisher/docs/PARSING.md

# Kingfisher Source Code Parsing

[← Back to README](../README.md)

Kingfisher uses a parser-based context verifier as a second pass on supported source files. After its initial regex scan (powered by Vectorscan/Hyperscan), it extracts assignment-style snippets from code and configuration files to confirm that generic keyword+token matches appear in plausible contexts.

The implementation favors lightweight extractors over full AST parsing:

- **Handwritten lexers** for common programming and config languages — comment-aware stripping followed by regex-based `key = value` extraction
- **`tl`** for HTML — attribute values, element text, and embedded `<script>` / `<style>` delegation
- **`cssparser`** for CSS — declaration parsing via Mozilla’s CSS tokenizer

> **History:** Earlier parser implementations relied on 17 statically-linked
> grammar crates. This added ~20 MB to the binary and required building a
> full syntax tree just to extract assignment pairs. The current lexer-based
> approach achieves the same extraction quality with near-zero binary overhead
> and no external grammar dependencies.

## How It’s Called

In the scanning phase (in the Matcher’s implementation), Kingfisher does the following:

- **Primary Regex Pass:** Kingfisher always scans the full blob with Vectorscan/Hyperscan first.
- **Candidate Selection:** Findings from rules classified as context-dependent become parser-verification candidates.
- **Language Detection:** If a language string is provided (for example from metadata or extension), the code maps it to a supported parser backend.
- **Parsing and Querying:** The parser streams normalized snippets such as `key = value` without materializing a full syntax tree.
- **Verification Decision:** Strict contextual candidates are kept only if parser-extracted context verifies the matched secret. More explicit assignment-style rules can still survive on raw regex evidence when parser verification is unavailable.

## Supported Languages

The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:

- **Scripting:** Bash, Python, Ruby, PHP
- **Compiled languages:** C, C++, C#, Rust, Java
- **Web-related languages:** CSS, HTML, JavaScript, TypeScript, YAML, TOML
- **Others:** Go

## When Context Verification Is Not Called

Context verification is skipped in certain cases:

- **No Language Identified:** If the file isn’t recognized as belonging to one of the supported languages or no language hint is provided, the context verifier isn’t even constructed.
- **Non-source Files:** Binary files or files that aren’t expected to contain code (or aren’t extracted from archives) bypass parser-based context verification.
- **Large Blobs:** Files larger than 2 MiB skip context verification to avoid spending time on generated or minified content.
- **Verification Errors:** If extraction fails, rules whose match profile strictly requires parser confirmation are suppressed. Assignment-style contextual rules can still fall back to their raw regex hit.

## Summary

Parser-based context verification is conditional and complementary. It is called only when the scanned file is a supported source or config file, and its role is to reduce noisy strict-context findings by checking them against extracted code/config structure without unnecessarily dropping clear assignment-style secrets from raw text inputs.

This layered approach helps improve the accuracy of secret detection while maintaining high performance.
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
+								# Kingfisher Source Code Parsing
-												dockerhub rule update and docs update

											
										
										
											2026-01-31 21:54:08 -08:00
 								[← Back to README](../README.md)
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								Kingfisher uses a parser-based context verifier as a second pass on supported source files. After its initial regex scan (powered by Vectorscan/Hyperscan), it extracts assignment-style snippets from code and configuration files to confirm that generic keyword+token matches appear in plausible contexts.
 								The implementation favors lightweight extractors over full AST parsing:
 								- **Handwritten lexers** for common programming and config languages — comment-aware stripping followed by regex-based `key = value` extraction
 								- **`tl`** for HTML — attribute values, element text, and embedded `<script>` / `<style>` delegation
 								- **`cssparser`** for CSS — declaration parsing via Mozilla’s CSS tokenizer
-												fixed performance regression

											
										
										
											2026-04-09 22:21:02 -07:00
+								> **History:** Earlier parser implementations relied on 17 statically-linked
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								> grammar crates. This added ~20 MB to the binary and required building a
 								> full syntax tree just to extract assignment pairs. The current lexer-based
 								> approach achieves the same extraction quality with near-zero binary overhead
 								> and no external grammar dependencies.
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
+								## How It’s Called
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								In the scanning phase (in the Matcher’s implementation), Kingfisher does the following:
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
 								- **Primary Regex Pass:** Kingfisher always scans the full blob with Vectorscan/Hyperscan first.
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								- **Candidate Selection:** Findings from rules classified as context-dependent become parser-verification candidates.
 								- **Language Detection:** If a language string is provided (for example from metadata or extension), the code maps it to a supported parser backend.
 								- **Parsing and Querying:** The parser streams normalized snippets such as `key = value` without materializing a full syntax tree.
-												updates to new rules

											
										
										
											2026-04-15 14:37:26 -07:00
+								- **Verification Decision:** Strict contextual candidates are kept only if parser-extracted context verifies the matched secret. More explicit assignment-style rules can still survive on raw regex evidence when parser verification is unavailable.
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
+								## Supported Languages
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
 								The design supports many common source code languages. The Language enum (defined in the parser module) includes variants for:
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								- **Scripting:** Bash, Python, Ruby, PHP
 								- **Compiled languages:** C, C++, C#, Rust, Java
 								- **Web-related languages:** CSS, HTML, JavaScript, TypeScript, YAML, TOML
 								- **Others:** Go
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								## When Context Verification Is Not Called
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								Context verification is skipped in certain cases:
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
-												Replaced tree-sitter with a lighter parser-based context verifier built from handwritten lexers plus tl/cssparser, preserving context-dependent matching while cutting about 19 MB from the release binary.

											
										
										
											2026-04-07 23:20:17 -07:00
+								- **No Language Identified:** If the file isn’t recognized as belonging to one of the supported languages or no language hint is provided, the context verifier isn’t even constructed.
 								- **Non-source Files:** Binary files or files that aren’t expected to contain code (or aren’t extracted from archives) bypass parser-based context verification.
 								- **Large Blobs:** Files larger than 2 MiB skip context verification to avoid spending time on generated or minified content.
-												updates to new rules

											
										
										
											2026-04-15 14:37:26 -07:00
+								- **Verification Errors:** If extraction fails, rules whose match profile strictly requires parser confirmation are suppressed. Assignment-style contextual rules can still fall back to their raw regex hit.
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												v1.87.0

											
										
										
											2026-03-09 20:11:58 -07:00
+								## Summary
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
-												updates to new rules

											
										
										
											2026-04-15 14:37:26 -07:00
+								Parser-based context verification is conditional and complementary. It is called only when the scanned file is a supported source or config file, and its role is to reduce noisy strict-context findings by checking them against extracted code/config structure without unnecessarily dropping clear assignment-style secrets from raw text inputs.
-												preparing for v1.12

											
										
										
											2025-06-24 17:17:16 -07:00
 								This layered approach helps improve the accuracy of secret detection while maintaining high performance.