Added checksum comparisons to pattern_requirements, new suffix, crc32, and base62 Liquid filters, and verbose logging so mismatched checksums are skipped with context rather than reported as findings.

This commit is contained in:
Mick Grove 2025-11-07 16:31:24 -08:00
commit ccbbbad5bc
16 changed files with 2355 additions and 122 deletions

View file

@ -7,6 +7,8 @@ All notable changes to this project will be documented in this file.
- Added `pattern_requirements` checks to rules, providing lightweight post-regex character-class validation without lookarounds.
- Added an `ignore_if_contains` option to `PatternRequirements` to drop matches containing case-insensitive placeholder words, with tests covering the new behavior.
- Updated rules to adopt the new `pattern_requirements` support.
- Added checksum comparisons to `pattern_requirements`, new `suffix`, `crc32`, and `base62` Liquid filters, and verbose logging so mismatched checksums are skipped with context rather than reported as findings.
- Split GitHub token detections into fine-grained/fixed-format variants and enforce checksum validation for modern GitHub token families (PAT, OAuth, App, refresh) while preserving legacy coverage.
- Automatically enable `--no-dedup` when `--manage-baseline` is supplied so baseline management keeps every finding.

288
CHANGELOG.md.orig Normal file
View file

@ -0,0 +1,288 @@
# Changelog
All notable changes to this project will be documented in this file.
## [v1.62.0]
- This release is focused on further improving detection accuracy, before even attempting to validate findings.
- Added `pattern_requirements` checks to rules, providing lightweight post-regex character-class validation without lookarounds.
- Added an `ignore_if_contains` option to `PatternRequirements` to drop matches containing case-insensitive placeholder words, with tests covering the new behavior.
- Updated rules to adopt the new `pattern_requirements` support.
- Automatically enable `--no-dedup` when `--manage-baseline` is supplied so baseline management keeps every finding.
## [v1.61.0]
- Fixed local filesystem scans to keep `open_path_as_is` enabled when opening Git repositories and only disable it for diff-based scans.
- Created Linux and Windows specific installer script
- Updated diff-focused scanning so `--branch-root-commit` can be provided alongside `--branch`, letting you diff from a chosen commit while targeting a specific branch tip (still defaulting back to the `--branch` ref when the commit is omitted).
- Updated rules
## [v1.60.0]
- Removed the `--bitbucket-username`, `--bitbucket-token`, and `--bitbucket-oauth-token` flags in favour of `KF_BITBUCKET_*` environment variables when authenticating to Bitbucket.
- Added provider-specific `kingfisher scan` subcommands (for example `kingfisher scan github …`) that translate into the legacy flags under the hood. The new layout keeps backwards compatibility while removing the wall of provider options from `kingfisher scan --help`.
- Updated the README so every provider example (GitHub, GitLab, Bitbucket, Azure Repos, Gitea, Hugging Face, Slack, Jira, Confluence, S3, GCS, Docker) uses the new subcommand style.
- Legacy provider flags (for example `--github-user`, `--gitlab-group`, `--bitbucket-workspace`, `--s3-bucket`) still work but now emit a deprecation warning to encourage migration to the new `kingfisher scan <provider>` flow.
- Kept the direct `kingfisher scan /path/to/dir` flow for local filesystem / local git repo scans while adding a `--list-only` switch to each provider subcommand so repository enumeration no longer requires the standalone `github repos`, `gitlab repos`, etc. commands.
- Removed the legacy top-level provider commands (`kingfisher github`, `kingfisher gitlab`, `kingfisher gitea`, `kingfisher bitbucket`, `kingfisher azure`, `kingfisher huggingface`) now that enumeration lives under `kingfisher scan <provider> --list-only`.
## [v1.59.0]
- Fixed `kingfisher scan github …` (and other provider-specific subcommands) so they no longer demand placeholder path arguments before the CLI accepts the request.
- Fixed `kingfisher scan` so that providing `--branch` without `--since-commit` now diffs the branch against the empty tree and scans every commit reachable from that branch.
- Added rules for meraki, duffel, finnhub, frameio, freshbooks, gitter, infracost, launchdarkly, lob, maxmind, messagebird, nytimes, prefect, scalingo, sendinblue, sentry, shippo, twitch, typeform
- ## [v1.58.0]
- Added first-class Hugging Face scanning support, including CLI enumeration, token authentication, and integration with remote scans.
- Condensed GitError formatting to report the exit status and the first informative lines from stdout/stderr, producing concise git clone failure logs.
- Added support for scanning Google Cloud Storage buckets via `--gcs-bucket`, including optional prefixes and service-account authentication.
- Added `--skip-aws-account` (now accepting comma-separated values) and `--skip-aws-account-file` to bypass live AWS validation for known canary/honey-token account IDs without triggering alerts. Kingfisher now ships with several canary AWS account IDs pre-seeded in the skip list and now reports matching findings as "Not Attempted" with the "Response" containing "(skip list entry)" so it's clear that validation was intentionally skipped and why.
## [v1.57.0]
- Added inline ignore directive detection to treat suppression tokens anywhere on surrounding lines, including multi-line handling
- Added a `--no-ignore` CLI flag to disable inline directives when you need every potential secret reported
- Added: repeatable `--ignore-comment <TOKEN>` flag to reuse inline directives from other scanners (for example `NOSONAR`, `kics-scan ignore`, `gitleaks:allow`, etc)
- Respect user color settings in update messages by using the same color helper as the main reporter, ensuring consistent output and no ANSI codes on update check, when color is disabled
## [v1.56.0]
- Fixed tree-sitter scanning bug where passing --no-base64 caused errors to be printed when the file type couldnt be determined
## [v1.55.0]
- Added first-class Azure Repos support, including CLI commands, enumeration, and documentation updates
- Improved performance of tree-sitter parsing
- Updated Windows build script to ensure static binary is produced
## [v1.54.0]
- Added first-class Gitea support, including CLI commands, environment-based authentication, documentation, and integration with scans and repository enumeration.
- Populate the finding path from git blob metadata so history-derived secrets display their file location instead of an empty path
- Replaced Match::finding_ids SHA1-based hashing with a fast xxh3_64 digest that keeps IDs deterministic while eliminating a hot-path SHA1 dependency
## [v1.53.0]
- Added first-class Bitbucket support, including CLI commands, authentication helpers, documentation, and integration testing.
## [v1.52.0]
- Enabled ANSI formatting in the tracing formatter whenever stderr is attached to a terminal so colorized updater messages render correctly instead of showing escape sequences.
- Added a new CLI flag, `--user-agent-suffix` to allow developers to append additional information to the user-agent
- Removed the unused --rlimit-nofile flag
## [1.51.0]
- Added diff-only Git scanning via `--since-commit` and `--branch`, including remote-aware ref resolution so CI jobs can pair `--git-url` clones with pull request branches
## [1.50.0]
- Added `--github-exclude` and `--gitlab-exclude` options to skip specific repositories when scanning or listing GitHub and GitLab sources, including support for gitignore-style glob patterns
## [1.49.0]
- Enabled MongoDB URI validation
- AWS + GCP validators now respect HTTPS_PROXY and share a consistent user agent across AWS, GCP, and HTTP validation
- Increase max-file-size default to 256 mb (up from 64 mb)
- Improved AWS rule
## [1.48.0]
- Improved error message when self-update cannot find the current binary
- Optimized memory usage via string interning and extensive data sharing
- Replaced quadratic match filtering with a per-rule span map, fixing missed secrets in extremely large files and improving scan performance
- Support scanning extremely large files by chunking input into 1 GiB segments with small overlaps, avoiding vectorscan buffer limits while preserving match offsets
- Always use chunked vectorscan, eliminating the slow regex fallback for blobs over 4 GiB
- Skip Base64 scanning for blobs over 64 MB to avoid a second pass over massive files
- Increased max-file-size default to 64 MB (up from 25 MB)
## [1.47.0]
- MongoDB validator now validates `mongodb+srv://` URIs with a fast timeout instead of skipping them
- Improved rules: github oauth2, diffbot, mailchimp, aws
- Added validation to SauceLabs rule
- Added rules: shodan, bitly, flickr
- Decode Base64 blobs and scan their contents for secrets while skipping short strings for performance. This has a small performance impact and can be disabled with `--no-base64`
## [1.46.0]
- Improved rules: AWS, pem
- Added rule for Ollama, Weights and Biases, Cerebras, Friendli, Fireworks.ai, NVIDIA NIM, together.ai, zhipu
- Added `self-update` command to update the binary independently. Now supports updating over homebrew managed binary
- MongoDB validator now checks `mongodb+srv://` URIs with fast-fail timeouts
## [1.45.0]
- Added `--repo-artifacts` flag to scan repository issues, gists/snippets, and wikis when cloning via `--git-url`
- Added rules for sendbird, mattermost, langchain, notion
- JWT validation hardened to reject alg:none by default (only allowed if explicitly configured), require iss for OIDC/JWKS verification, ensuring "Active Credential" means cryptographically verified and time-valid, not just unexpired
- Updated the Git cloning logic to include all refs and minimize clone output, allowing Kingfisher to analyze pull request and deleted branch history
## [1.44.0]
- Fixed issue with self-update on Linux
- Reverted the change to json and jsonl outputs by rule
- Added `--skip-regex` and `--skip-word` flags to ignore secrets matching custom patterns or skipwords
## [1.43.0]
- Added rules for clearbit, kickbox, azure container registry, improved Azure Storage key
- Grouped JSON and JSONL outputs by rule, restoring `matches` arrays in reports
## [1.42.0]
- Fixed pagination issue when calling gitlab api
- Expanded directory exclusion handling to interpret plain patterns as prefixes, ensuring options like --exclude .git also skip all nested paths
- Updated baseline management to track encountered findings and remove entries that are no longer present, saving the baseline file whenever entries are pruned or new matches are added
- Added rules for authress, clickhouse, codecov, contentful, curl, dropbox, fly.io, hubspot, firecrawl
- Internal refactoring of rule loader, git enumerator, and filetype guesser
- Improved language detection
## [1.41.0]
- Added support for scanning gitlab subgroups, with `kingfisher scan --gitlab-group my-group --gitlab-include-subgroups`
- Added rule for Vercel
## [1.40.0]
- Dropped the “prevalidated” flag from rule definitions and validation logic so every finding now flows through the standard active/inactive/unknown pipeline, simplifying rule configuration and preventing specialcase bypasses
- Improved Tailscale api key detectors
## [1.39.0]
- Added support for scanning Confluence pages via `--confluence-url` and `--cql`
## [1.38.0]
- `--quiet` now suppresses scan summaries and rule statistics unless `--rule-stats` is explicitly provided
- Added X Consumer key detection and validation
## [1.37.0]
- GitLab: Matched GitLab group repository listings to glab by only enumerating projects that belong directly to each group, without automatically traversing subgroups
## [1.36.0]
- Fixed GitHub organization and GitLab group scans when using `--git-history=none`
- JWT tokens without both `iss` and `aud` are no longer reported as active credentials
## [1.35.0]
- Remote scans with `--git-history=none` now clone repositories with a working tree and scan the current files instead of erroring with "No inputs to scan".
- Fixed issue where `--redact` did not function properly
- Fixed validation logic for clarifai rule
## [1.34.0]
- Use system TLS root certificates to support self-hosted GitLab instances with internal CAs
- Added new rule: Coze personal access token
- Updated Supabase rule to detect project url's and validate their corresponding tokens
## [1.33.0]
- Fixed header precedence so custom HTTP validation headers like `Accept` are preserved
- Added new Heroku rule
## [1.32.0]
- Added support for scanning AWS S3 buckets via `--s3-bucket` and optional `--s3-prefix`
- Added `--role-arn` and `--aws-local-profile` flags for S3 authentication alongside `KF_AWS_KEY`/`KF_AWS_SECRET`
- Added progress bar for scanning s3 buckets
- Refactored output reporting and formatting logic
## [1.31.0]
- New rules: Telegram bot token, OpenWeatherMap, Apify, Groq
- New OpenAI detectors added (@joshlarsen)
- Fixed bug that broke validation when using unnamed group captures
## [1.30.0]
- Fixed validation caching for HTTP validators to include rendered headers so inactive secrets no longer appear active.
- Removed pre-commit installation hook, due to bugs
## [1.29.0]
- Fixed issue when more than 1 named capture group is used in a rule variable
- Added a new liquid template filters: `b64dec`
- Added custom validator for Coinbase, and a Coinbase rule that uses it
## [1.28.0]
- Added support for scanning Slack
## [1.27.0]
- Added Buildkite rule
- Added support for scanning Docker images via `--docker-image`
## [1.26.0]
- Added rule for ElevenLabs
- Added support for scanning Jira issues via a given JQL (Jira Query Language)
## [1.25.0]
- Fixed GitLab authentication bug
- Added pre-commit and pre-receive installation hooks
- MongoDB validator now skips `mongodb+srv://` URIs and returns a message that validation was skipped
- Fixed noisy Baseten rule
## [1.24.0]
- Now generating DEB and RPM packages
- Now releasing Docker images, and updated README
- Added rule for Scale, Deepgram, AssemblyAI
## [1.23.0]
- Updating GitHub Action to generate Docker image
- Added rules for Diffbot, ai21, baseten
- Fixed supabase rule
- Added 'alg' to JWT validation output
## [1.22.0]
- Added rules for Google Gemini AI, Cohere, Stability.ai, Replicate, Runway, Clarifai
- Upgraded dependencies
## [1.21.0]
- Improved Azure Storage rule
- Added rule to detect TravisCI encrypted values
- Added baseline feature with `--baseline-file` and `--manage-baseline` flags
- Introduced `--exclude` option for skipping paths
- Added tests covering baseline and exclude workflow
- Added validation for JWT tokens that checks `exp` and `nbf` claims
- JWT validation performs OpenID Connect discovery using the `iss` claim and verifies signatures via JWKS
- Removed `--ignore-tests` argument, because the `--exclude` flag provides more granular functionality
- DigitalOcean rule update
- Adafruit rule update
## [1.20.0]
- Removed confirmation prompt when user provides --self-update flag
- Added support for HTTP request bodies in rule validation
- Added new liquid-rs filters: HmacSha1, IsoTimestampNoFracFilter, Replace
- Added rules for mistral, perplexity
- Added validation for Alibaba rule
- Set GIT_TERMINAL_PROMPT=0 when cloning git repos
## [1.19.0]
- JSON output was missing committer name and email
- Fixed Gitlab rule which was incorrectly identifying certain tokens as valid
## [1.18.1]
- Restored --version cli argument
- Added test for the argument
## [1.18.0]
- Added rules for DeepSeek, xAI
- Removed branding
- Added NOTICE file
## [1.17.1]
- Fixed broken sourcegraph rule
- Added test to prevent this and similar issues
## [1.17.0]
- Updated README to give proper attribution to Nosey Parker!
- Added rules for sonarcloud, sonarqube, sourcegraph, shopify, truenas, square, sendgrid, nasa, teamcity, truenas, shopify
- Introduced `--ignore-tests` flag skip files/dirs whose path resembles tests (`test`, `spec`, `fixture`, `example`, `sample`), reducing noise.
## [1.16.0]
- Fix: HTML detection now requires both HTML content-type and "<html" tag, fixing webhook false negatives
- Removed cargo-nextest installation during test running
- Added rules for 1password, droneci
## [1.15.0]
- Ensuring temp files are cleaned up
- Applying visual style to the update check output
- Fixed bug in --self-update where it was looking for the incorrect binary name on GitHub releases
- Rule cleanup
## [1.14.0]
- Fixed several malformed rules
- Now validating that response_matcher is present in validation section of all rules
## [1.13.0]
- Added new rules for Planetscale, Postman, Openweather, opsgenie, pagerduty, pastebin, paypal, netlify, netrc, newrelic, ngrok, npm, nuget, mandrill, mapbox, microsoft teams, stripe, linkedin, mailchimp, mailgun, linear, line, huggingface, ibm cloud, intercom, ipstack, heroku, gradle, grafana
- Added `--rule-stats` command-line flag that will display rule performance statistics during a scan. Useful when creating or debugging rules
## [1.12.0]
- Added automatic update checks using GitHub releases.
- New `--self-update` flag installs updates when available
- New `--no-update-check` flag disables update checks
- Updated rules
## [1.11.0] 2025-06-21
- Increased default value for number of scanning jobs to improve validation speed
- Fixed issue where some API responses (e.g. GitHub's `/user` endpoint) include required fields like `"name"` beyond the first 512 bytes. Truncating earlier causes `WordMatch` checks to fail even for active credentials. Increased the limit to keep a larger slice of the body while still bounding memory usage.
## [1.10.0] 2025-06-20
- Updated de-dupe fingerprint to include the content of the match
- Updated Makefile
- Adding GitHub Actions
## [1.9.0] 2025-06-16
- Initial public release of Kingfisher

36
CHANGELOG.md.rej Normal file
View file

@ -0,0 +1,36 @@
@@ -1,33 +1,35 @@
# Changelog
All notable changes to this project will be documented in this file.
## [Unreleased]
- Added `pattern_requirements` for rules. Enables post-regex character-class checks (digits, uppercase, lowercase, specials) to reduce false positives without lookarounds. Provides lightweight, in-memory validation after matches, keeping patterns fast and readable.
- Added an optional `ignore_if_contains` list to `PatternRequirements` within the Rules structure, so matches containing case-insensitive placeholder words are filtered out, with accompanying tests to cover the new behavior.
- Updated many rules with `pattern_requirements`
+- Added checksum comparisons to `pattern_requirements`, new `suffix`, `crc32`, and `base62` Liquid filters, and verbose logging so mismatched checksums are skipped with context rather than reported as findings.
+- Split GitHub token detections into fine-grained/fixed-format variants and enforce checksum validation for modern GitHub token families (PAT, OAuth, App, refresh) while preserving legacy coverage.
- Automatically set `--no-dedup` whenever `--manage-baseline` is supplied so baseline management retains every occurrence of a finding
## [v1.61.0]
- Fixed local filesystem scans to keep `open_path_as_is` enabled when opening Git repositories and only disable it for diff-based scans.
- Created Linux and Windows specific installer script
- Updated diff-focused scanning so `--branch-root-commit` can be provided alongside `--branch`, letting you diff from a chosen commit while targeting a specific branch tip (still defaulting back to the `--branch` ref when the commit is omitted).
- Updated rules
## [v1.60.0]
- Removed the `--bitbucket-username`, `--bitbucket-token`, and `--bitbucket-oauth-token` flags in favour of `KF_BITBUCKET_*` environment variables when authenticating to Bitbucket.
- Added provider-specific `kingfisher scan` subcommands (for example `kingfisher scan github …`) that translate into the legacy flags under the hood. The new layout keeps backwards compatibility while removing the wall of provider options from `kingfisher scan --help`.
- Updated the README so every provider example (GitHub, GitLab, Bitbucket, Azure Repos, Gitea, Hugging Face, Slack, Jira, Confluence, S3, GCS, Docker) uses the new subcommand style.
- Legacy provider flags (for example `--github-user`, `--gitlab-group`, `--bitbucket-workspace`, `--s3-bucket`) still work but now emit a deprecation warning to encourage migration to the new `kingfisher scan <provider>` flow.
- Kept the direct `kingfisher scan /path/to/dir` flow for local filesystem / local git repo scans while adding a `--list-only` switch to each provider subcommand so repository enumeration no longer requires the standalone `github repos`, `gitlab repos`, etc. commands.
- Removed the legacy top-level provider commands (`kingfisher github`, `kingfisher gitlab`, `kingfisher gitea`, `kingfisher bitbucket`, `kingfisher azure`, `kingfisher huggingface`) now that enumeration lives under `kingfisher scan <provider> --list-only`.
## [v1.59.0]
- Fixed `kingfisher scan github …` (and other provider-specific subcommands) so they no longer demand placeholder path arguments before the CLI accepts the request.
- Fixed `kingfisher scan` so that providing `--branch` without `--since-commit` now diffs the branch against the empty tree and scans every commit reachable from that branch.
- Added rules for meraki, duffel, finnhub, frameio, freshbooks, gitter, infracost, launchdarkly, lob, maxmind, messagebird, nytimes, prefect, scalingo, sendinblue, sentry, shippo, twitch, typeform
- ## [v1.58.0]
- Added first-class Hugging Face scanning support, including CLI enumeration, token authentication, and integration with remote scans.
- Condensed GitError formatting to report the exit status and the first informative lines from stdout/stderr, producing concise git clone failure logs.
- Added support for scanning Google Cloud Storage buckets via `--gcs-bucket`, including optional prefixes and service-account authentication.

View file

@ -75,9 +75,9 @@ include_dir = { version = "0.7", features = ["glob"] }
strum = { version = "0.26", features = ["derive"] }
sysinfo = "0.31.4"
reqwest = { version = "0.12", default-features = false, features = [
"json",
"gzip",
"brotli",
"json",
"gzip",
"brotli",
"deflate",
"stream",
"rustls-tls",
@ -196,6 +196,7 @@ gcloud-storage = { version = "1.1.1", default-features = false, features = [
"auth",
] }
tokei = "12.1.2"
crc32fast = "1.4.0"
[target.'cfg(not(windows))'.dependencies]
sha1 = { version = "0.10.6", features = ["asm"] }

View file

@ -333,10 +333,13 @@ is independent:
- `special_chars` lets you override the set of characters counted as "special" when `min_special_chars` is used.
- `ignore_if_contains` lists case-insensitive substrings that should cause a match to be discarded (for example, to drop
`test`, `demo`, or `localhost` values).
- `checksum` lets you compare an extracted portion of the match against a Liquid-rendered expectation. Provide `actual.template`
and `expected` Liquid snippets (with access to `{{ MATCH }}`, `{{ FULL_MATCH }}`, and any named capture as both its original
case and uppercase alias) and Kingfisher will skip the finding when the rendered values differ. Optional keys such as
`requires_capture` and `skip_if_missing` help you guard against legacy formats while onboarding the checksum-aware variant.
When a match is skipped because of `ignore_if_contains`, Kingfisher logs the event at the `DEBUG` level alongside the rule that
was evaluated. If you need to keep those matches for a particular scan, pass `--no-ignore-if-contains` to `kingfisher scan` to
disable the substring filter without editing any rule files.
When a match is skipped because of `ignore_if_contains` or a checksum mismatch, Kingfisher logs the event at the `DEBUG` level alongside the rule that was evaluated. If you need to keep those matches for a particular scan, pass `--no-ignore-if-contains` to `kingfisher scan` to disable the substring filter without editing any rule files. Verbose mode (`-v`) will also show you the
checksum mismatch lengths so you can confirm why a finding was suppressed.
Once you've done that, you can provide your custom rules (defined in a YAML file) and provide it to Kingfisher at runtime --- no recompiling required!

1336
README.md.orig Normal file

File diff suppressed because it is too large Load diff

68
README.md.rej Normal file
View file

@ -0,0 +1,68 @@
@@ -311,54 +311,63 @@
| **Dev & CI/CD** | GitHub/GitLab tokens, CircleCI, TravisCI, TeamCity, Docker Hub, npm, PyPI, and more |
| **Messaging & Comms** | Slack, Discord, Microsoft Teams, Twilio, Mailgun, SendGrid, Mailchimp, and more |
| **Databases & Data Ops** | MongoDB Atlas, PlanetScale, Postgres DSNs, Grafana Cloud, Datadog, Dynatrace, and more |
| **Payments & Billing** | Stripe, PayPal, Square, GoCardless, and more |
| **Security & DevSecOps** | Snyk, Dependency-Track, CodeClimate, Codacy, OpsGenie, PagerDuty, and more |
| **Misc. SaaS & Tools** | 1Password, Adobe, Atlassian/Jira, Asana, Netlify, Baremetrics, and more |
## 📝 Write Custom Rules!
Kingfisher ships with hundreds of rules with HTTP and servicespecific validation checks (AWS, Azure, GCP, etc.) to confirm if a detected string is a live credential.
However, you may want to add your own custom rules, or modify a detection to better suit your needs / environment.
First, review [docs/RULES.md](/docs/RULES.md) to learn how to create custom Kingfisher rules.
### Pattern requirements and placeholder filtering
Every rule can declare optional `pattern_requirements` to enforce additional character checks after a regex matches. Each field
is independent:
- `min_digits`, `min_uppercase`, `min_lowercase`, and `min_special_chars` enforce complexity thresholds.
- `special_chars` lets you override the set of characters counted as "special" when `min_special_chars` is used.
- `ignore_if_contains` lists case-insensitive substrings that should cause a match to be discarded (for example, to drop
`test`, `demo`, or `localhost` values). Kingfisher still accepts the legacy `exclude_words` key as an alias when loading
existing rule files.
-
-When a match is skipped because of `ignore_if_contains`, Kingfisher logs the event at the `DEBUG` level alongside the rule that
-was evaluated. If you need to keep those matches for a particular scan, pass `--no-ignore-if-contains` to `kingfisher scan` to
-disable the substring filter without editing any rule files.
+- `checksum` lets you compare an extracted portion of the match against a Liquid-rendered expectation. Provide `actual.template`
+ and `expected` Liquid snippets (with access to `{{ MATCH }}`, `{{ FULL_MATCH }}`, and any named capture as both its original
+ case and uppercase alias) and Kingfisher will skip the finding when the rendered values differ. Optional keys such as
+ `requires_capture` and `skip_if_missing` help you guard against legacy formats while onboarding the checksum-aware variant.
+
+When a match is skipped because of `ignore_if_contains` or a checksum mismatch, Kingfisher logs the event at the `DEBUG` level
+alongside the rule that was evaluated. If you need to keep those matches for a particular scan, pass `--no-ignore-if-contains`
+to `kingfisher scan` to disable the substring filter without editing any rule files. Verbose mode (`-v`) will also show you the
+checksum mismatch lengths so you can confirm why a finding was suppressed.
+
+To support checksum workflows, Kingfisher now includes Liquid helpers such as `suffix` (to slice characters from a match),
+`crc32` (to hash the body), and `base62` (to encode and pad the checksum). You can mix these filters with your own templates to
+mirror provider-specific checksum implementations.
Once you've done that, you can provide your custom rules (defined in a YAML file) and provide it to Kingfisher at runtime --- no recompiling required!
# 🎉 Usage
## Basic Examples
> **Note**  `kingfisher scan` detects whether the input is a Git repository or a plain directory, no extra flags required.
### Scan with secret validation
```bash
kingfisher scan /path/to/code
## NOTE: This path can refer to:
# 1. a local git repo
# 2. a directory with many git repos
# 3. or just a folder with files and subdirectories
## To explicitly prevent scanning git commit history add:
# `--git-history=none`
```
### Scan a directory containing multiple Git repositories
```bash

View file

@ -40,13 +40,18 @@ rules:
pattern: |
(?xi)
\b
(
ghp_
[A-Z0-9]{36}
(
ghp_(?P<body>[A-Z0-9]{30})(?P<checksum>[A-Z0-9]{6})
)
pattern_requirements:
min_digits: 2
min_lowercase: 2
checksum:
actual:
template: "{{ MATCH | suffix: 6 }}"
requires_capture: checksum
expected: "{{ BODY | crc32 | base62: 6 }}"
skip_if_missing: true
min_entropy: 3.5
examples:
- "GITHUB_KEY=ghp_XIxB7KMNdAr3zqWtQqhE94qglHqOzn1D1stg"
@ -82,11 +87,16 @@ rules:
(?xi)
\b
(
gho_
[A-Z0-9]{36}
gho_(?P<body>[A-Z0-9]{30})(?P<checksum>[A-Z0-9]{6})
)
pattern_requirements:
min_digits: 2
min_digits: 2
checksum:
actual:
template: "{{ MATCH | suffix: 6 }}"
requires_capture: checksum
expected: "{{ BODY | crc32 | base62: 6 }}"
skip_if_missing: true
min_entropy: 3.5
confidence: medium
examples:
@ -119,7 +129,7 @@ rules:
pattern: |
(?xi)
(
ghu_[A-Z0-9]{36}
ghu_(?P<body>[A-Z0-9]{30})(?P<checksum>[A-Z0-9]{6})
)
examples:
- ' "token": "ghu_16C7e42F292c69C2E7C10c838347Ae178B4a",'
@ -153,7 +163,7 @@ rules:
pattern: |
(?xi)
(
ghs_[A-Z0-9]{36}
ghs_(?P<body>[A-Z0-9]{30})(?P<checksum>[A-Z0-9]{6})
)
examples:
- ' "token": "ghs_16C7e42F292c69C2E7C10c838347Ae178B4a",'
@ -187,7 +197,7 @@ rules:
pattern: |
(?xi)
(
ghr_[A-Z0-9]{76}
ghr_(?P<body>[A-Z0-9]{30})(?P<checksum>[A-Z0-9]{6})
)
examples:
- ' "refresh_token": "ghr_1B4a2e77838347a7E420ce178F2E7c6912E169246c3CE1ccbF66C46812d16D5B1A9Dc86A1498",'

View file

@ -117,12 +117,15 @@ Below is the complete list of Liquid filters available in Kingfisher, along with
| --------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| `b64enc` | | Base64-encodes the input using the standard alphabet. | `{{ TOKEN \| b64enc }}` |
| `b64url_enc` | | URL-safe Base64 (no padding). Useful for JWT headers & payloads. | `{{ TOKEN \| b64url_enc }}` |
| `b64dec` | | Decodes a Base64 string. | `{{ "aGVsbG8=" \| b64dec }}` |
| `b64dec` | | Decodes a Base64 string. | `{{ "aGVsbG8=" \| b64dec }}` |
| `sha256` | | Computes the SHA-256 hex digest of the input. | `{{ TOKEN \| sha256 }}` |
| `crc32` | | Computes the CRC32 checksum of the input and returns a decimal value. | `{{ TOKEN \| crc32 }}` |
| `hmac_sha1` | `key` (string) | Computes HMAC-SHA1 over the input, returns Base64-encoded result. | `{{ TOKEN \| hmac_sha1: "secret-key" }}` |
| `hmac_sha256` | `key` (string) | Computes HMAC-SHA256 over the input, returns Base64-encoded result. | `{{ TOKEN \| hmac_sha256: "secret-key" }}` |
| `hmac_sha384` | `key` (string) | Computes HMAC-SHA384 over the input, returns Base64-encoded result. | `{{ TOKEN \| hmac_sha384: "secret-key" }}` |
| `random_string` | `len` (integer, optional) | Generates a cryptographically-secure random alphanumeric string of the specified length (default: 32). | `{{ "" \| random_string: 16 }}` |
| `suffix` | `len` (integer, optional) | Returns the last `len` characters from the string (default: full). | `{{ TOKEN \| suffix: 6 }}` |
| `base62` | `width` (integer, optional) | Encodes the input number as Base62, left-padding with zeros as needed. | `{{ TOKEN \| crc32 \| base62: 6 }}` |
| `url_encode` | | Percent-encodes the input according to RFC 3986. | `{{ TOKEN \| url_encode }}` |
| `json_escape` | | Escapes special characters so a string can be safely injected into JSON contexts. | `{{ TOKEN \| json_escape }}` |
| `unix_timestamp` | | Returns the current Unix epoch time in seconds (UTC). | `{{ "" \| unix_timestamp }}` |
@ -269,13 +272,21 @@ pattern_requirements:
ignore_if_contains: # Optional: reject matches containing any of these (case-insensitive)
- test
- demo
checksum: # Optional: compare rendered values to drop mismatched formats
actual:
template: "{{ MATCH | suffix: 6 }}" # Liquid template for the observed checksum
requires_capture: checksum # (optional) skip unless this capture is present
expected: "{{ BODY | crc32 | base62: 6 }}" # Liquid template to render the expected checksum
skip_if_missing: true # (optional) treat missing captures as legacy tokens
```
All fields are optional. If `special_chars` is not specified, the default set includes: `!@#$%^&*()_+-=[]{}|;:'",.<>?/\`~`
`ignore_if_contains` performs a case-insensitive substring check. If any entry (after trimming whitespace) appears within the match, the match is discarded. This is helpful for dropping known dummy tokens such as "test" or "demo" that otherwise satisfy the regex.
When this filter removes a match it is logged at the `DEBUG` level so you can see exactly which substring caused the skip. If you need to keep every match even when one of these substrings appears, pass `--no-ignore-if-contains` to `kingfisher scan`. The flag disables this post-processing step without changing the rule definitions.
The optional `checksum` block renders Liquid templates against the match to determine whether the captured checksum matches your expectation. Both templates gain access to `{{ MATCH }}`, `{{ FULL_MATCH }}`, and every named capture in two forms: the original capture name and its uppercase alias (e.g. `{{ body }}` and `{{ BODY }}`). Use helper filters like `suffix`, `crc32`, and `base62` to mirror provider-specific checksum pipelines. If a required capture is missing or the rendered values differ, Kingfisher skips the finding—logging the reason, including checksum lengths, at the `DEBUG` level. Set `skip_if_missing` to `true` to treat absent captures as legacy matches.
When any of these filters remove a match it is logged at the `DEBUG` level so you can see exactly why the skip occurred. If you need to keep every match even when one of these substrings appears, pass `--no-ignore-if-contains` to `kingfisher scan`. The flag disables this post-processing step without changing the rule definitions.
### Example: Secure API Key

View file

@ -147,12 +147,23 @@ impl FindingsStore {
1. Optional duplicate filter (unchanged)
*/
if dedup {
// Prefer the full unnamed match (index 0). Fall back to a named TOKEN capture
// before using whatever capture is available.
let snippet = m
.groups
.captures
.get(1)
.or_else(|| m.groups.captures.get(0))
.map_or("", |c| c.value);
.iter()
.find(|c| c.name.is_none() && c.match_number == 0)
.map(|c| c.value)
.or_else(|| {
m.groups
.captures
.iter()
.find(|c| matches!(c.name.as_deref(), Some("TOKEN")))
.map(|c| c.value)
})
.or_else(|| m.groups.captures.get(0).map(|c| c.value))
.unwrap_or("");
let origin_kind = match origin.first() {
Origin::GitRepo(_) => "git",

View file

@ -1,6 +1,7 @@
//! Collection of small Liquid filters that make HTTP validations & API-signing templates easy
use base64::{engine::general_purpose, Engine};
use crc32fast::Hasher;
use hmac::{Hmac, Mac};
use liquid_core::{
Display_filter, Error as LiquidError, Expression, Filter, FilterParameters, FilterReflection,
@ -223,22 +224,90 @@ impl Filter for HmacSha384Filter {
}
// ── random_string ────────────────────────────────
static_filter!(
/// Random alphanumeric string (default 32 chars).
RandomStringFilter { len: Option<usize> },
"random_string",
|s: &RandomStringFilter, input: &dyn ValueView| -> String {
let n = s.len // explicit argument?
.or_else(|| input.to_kstr().parse().ok()) // else parse input
.unwrap_or(32); // else default
#[derive(Debug, FilterParameters)]
struct RandomStringArgs {
#[parameter(description = "Desired output length", arg_type = "integer")]
len: Option<Expression>,
}
rand::rng()
.sample_iter(&Alphanumeric)
.take(n)
.map(char::from)
.collect()
#[derive(Clone, ParseFilter, FilterReflection, Default)]
#[filter(
name = "random_string",
description = "Random alphanumeric string (default 32 chars).",
parameters(RandomStringArgs),
parsed(RandomString)
)]
pub struct RandomStringFilter;
#[derive(Debug, FromFilterParameters, Display_filter)]
#[name = "random_string"]
struct RandomString {
#[parameters]
args: RandomStringArgs,
}
impl Filter for RandomString {
fn evaluate(&self, input: &dyn ValueView, runtime: &dyn Runtime) -> Result<Value> {
let args = self.args.evaluate(runtime)?;
let n = args
.len
.and_then(|value| {
let scalar = Value::scalar(value);
value_to_usize(&scalar)
})
.or_else(|| input.to_kstr().parse().ok())
.unwrap_or(32);
let value: String =
rand::rng().sample_iter(&Alphanumeric).take(n).map(char::from).collect();
Ok(Value::scalar(value))
}
);
}
#[derive(Debug, FilterParameters)]
struct SuffixArgs {
#[parameter(description = "Number of trailing characters to keep", arg_type = "integer")]
len: Option<Expression>,
}
#[derive(Clone, ParseFilter, FilterReflection, Default)]
#[filter(
name = "suffix",
description = "Return the suffix (last N characters) of the provided string.",
parameters(SuffixArgs),
parsed(Suffix)
)]
pub struct SuffixFilter;
#[derive(Debug, FromFilterParameters, Display_filter)]
#[name = "suffix"]
struct Suffix {
#[parameters]
args: SuffixArgs,
}
impl Filter for Suffix {
fn evaluate(&self, input: &dyn ValueView, runtime: &dyn Runtime) -> Result<Value> {
let args = self.args.evaluate(runtime)?;
let text = input.to_kstr();
let requested = args
.len
.and_then(|value| {
let scalar = Value::scalar(value);
value_to_usize(&scalar)
})
.unwrap_or_else(|| text.len());
if requested == 0 {
return Ok(Value::scalar(String::new()));
}
let mut chars: Vec<char> = text.chars().collect();
let keep = requested.min(chars.len());
chars.drain(0..chars.len().saturating_sub(keep));
Ok(Value::scalar(chars.into_iter().collect::<String>()))
}
}
#[derive(Debug, Clone, Default, FilterReflection, ParseFilter)]
#[filter(
@ -307,6 +376,111 @@ static_filter!(
}
);
static_filter!(
/// Compute the CRC32 of the input and return it as a decimal number.
Crc32Filter,
"crc32",
|input: &dyn ValueView| -> i64 {
let mut hasher = Hasher::new();
hasher.update(input.to_kstr().as_bytes());
i64::from(hasher.finalize())
}
);
#[derive(Debug, FilterParameters)]
struct Base62Args {
#[parameter(
description = "Pad the encoded value to at least this width",
arg_type = "integer"
)]
width: Option<Expression>,
}
#[derive(Clone, ParseFilter, FilterReflection, Default)]
#[filter(
name = "base62",
description = "Encode the provided integer value using Base62.",
parameters(Base62Args),
parsed(Base62)
)]
pub struct Base62Filter;
#[derive(Debug, FromFilterParameters, Display_filter)]
#[name = "base62"]
struct Base62 {
#[parameters]
args: Base62Args,
}
impl Filter for Base62 {
fn evaluate(&self, input: &dyn ValueView, runtime: &dyn Runtime) -> Result<Value> {
let args = self.args.evaluate(runtime)?;
let value = input
.as_scalar()
.and_then(|scalar| {
if let Some(int) = scalar.to_integer() {
Some(if int < 0 { 0 } else { int as u64 })
} else if let Some(float) = scalar.to_float() {
Some(if float.is_sign_negative() { 0 } else { float.floor() as u64 })
} else if let Some(boolean) = scalar.to_bool() {
Some(u64::from(boolean))
} else {
scalar.to_kstr().to_string().parse::<u64>().ok()
}
})
.or_else(|| input.to_kstr().to_string().parse::<u64>().ok())
.unwrap_or(0);
let mut encoded = encode_base62(value);
if let Some(width) = args.width.and_then(|value| {
let scalar = Value::scalar(value);
value_to_usize(&scalar)
}) {
if encoded.len() < width {
let mut padded = String::with_capacity(width);
for _ in 0..(width - encoded.len()) {
padded.push('0');
}
padded.push_str(&encoded);
encoded = padded;
}
}
Ok(Value::scalar(encoded))
}
}
fn encode_base62(mut value: u64) -> String {
const ALPHABET: &[u8; 62] = b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
if value == 0 {
return "0".to_string();
}
let mut buf = Vec::new();
while value > 0 {
let rem = (value % 62) as usize;
buf.push(ALPHABET[rem] as char);
value /= 62;
}
buf.iter().rev().collect()
}
fn value_to_usize(value: &Value) -> Option<usize> {
let view = value.as_view();
view.as_scalar()
.and_then(|scalar| {
if let Some(int) = scalar.to_integer() {
Some(if int < 0 { 0 } else { int as usize })
} else if let Some(float) = scalar.to_float() {
Some(if float.is_sign_negative() { 0 } else { float.floor() as usize })
} else if let Some(boolean) = scalar.to_bool() {
Some(if boolean { 1 } else { 0 })
} else {
scalar.to_kstr().parse::<usize>().ok()
}
})
.or_else(|| view.to_kstr().parse::<usize>().ok())
}
// {{ value | b64url_enc }} URL-safe base64 w/o padding
static_filter!(
/// Base64 URL-safe (no = padding).
@ -415,6 +589,9 @@ pub fn register_all(builder: liquid::ParserBuilder) -> liquid::ParserBuilder {
.filter(B64EncFilter::default())
.filter(B64DecFilter::default())
.filter(RandomStringFilter::default())
.filter(SuffixFilter::default())
.filter(Crc32Filter::default())
.filter(Base62Filter::default())
.filter(HmacSha256::default())
.filter(HmacSha1::default())
.filter(HmacSha384::default())
@ -461,6 +638,20 @@ mod tests {
assert_eq!(render(r#"{{ "hello" | sha256 }}"#), expect);
}
#[test]
fn suffix_filter() {
assert_eq!(render(r#"{{ "abcdef" | suffix: 3 }}"#), "def");
assert_eq!(render(r#"{{ "short" | suffix: 10 }}"#), "short");
assert_eq!(render(r#"{{ "value" | suffix: 0 }}"#), "");
}
#[test]
fn crc32_and_base62_filters() {
assert_eq!(render(r#"{{ "hello" | crc32 }}"#), "907060870");
assert_eq!(render(r#"{{ "hello" | crc32 | base62 }}"#), "zNvy2");
assert_eq!(render(r#"{{ "hello" | crc32 | base62: 6 }}"#), "0zNvy2");
}
#[test]
fn hmac_sha1_filter() {
let key = b"key1";

View file

@ -5,27 +5,27 @@
// * Fallback - system allocator (`system-alloc` feature)
// ────────────────────────────────────────────────────────────
// --- jemalloc (opt-in) ---
#[cfg(feature = "use-jemalloc")]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
// // --- jemalloc (opt-in) ---
// #[cfg(feature = "use-jemalloc")]
// #[global_allocator]
// static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
// --- mimalloc (default) ---
#[cfg(all(not(feature = "use-jemalloc"), not(feature = "system-alloc")))]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
// --- system allocator (explicit opt-out) ---
#[cfg(feature = "system-alloc")]
use std::alloc::System;
#[cfg(feature = "system-alloc")]
#[global_allocator]
static GLOBAL: System = System;
// // --- mimalloc (default) ---
// #[cfg(all(not(feature = "use-jemalloc"), not(feature = "system-alloc")))]
// #[global_allocator]
// static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
// // --- system allocator (explicit opt-out) ---
// #[cfg(feature = "system-alloc")]
// use std::alloc::System;
// #[cfg(feature = "system-alloc")]
// #[global_allocator]
// static GLOBAL: System = System;
use std::alloc::System;
#[global_allocator]
static GLOBAL: System = System;
use std::{
io::{IsTerminal, Read},
sync::{Arc, Mutex},

View file

@ -29,7 +29,7 @@ use crate::{
parser,
parser::{Checker, Language},
rule_profiling::{ConcurrentRuleProfiler, RuleStats, RuleTimer},
rules::rule::{PatternValidationResult, Rule},
rules::rule::{PatternRequirementContext, PatternValidationResult, Rule},
rules_database::RulesDatabase,
safe_list::{is_safe_match, is_user_match},
scanner_pool::ScannerPool,
@ -614,7 +614,12 @@ fn filter_match<'b>(
// Check character requirements if specified
if let Some(char_reqs) = rule.pattern_requirements() {
match char_reqs.validate(mi_bytes, respect_ignore_if_contains) {
let context = PatternRequirementContext {
regex: re,
captures: &captures,
full_match: full_bytes,
};
match char_reqs.validate(mi_bytes, Some(context), respect_ignore_if_contains) {
PatternValidationResult::Passed => {}
PatternValidationResult::Failed => {
debug!(
@ -623,6 +628,15 @@ fn filter_match<'b>(
);
continue;
}
PatternValidationResult::FailedChecksum { actual_len, expected_len } => {
debug!(
"Skipping match for rule {} due to checksum mismatch (actual_len={}, expected_len={})",
rule.id(),
actual_len,
expected_len
);
continue;
}
PatternValidationResult::IgnoredBySubstring { matched_term } => {
debug!(
"Skipping match for rule {} because it contains ignored term {matched_term}",
@ -790,40 +804,31 @@ impl SerializableCaptures {
redact: bool,
) -> Self {
let mut serialized_captures: SmallVec<[SerializableCapture; 2]> = SmallVec::new();
// Process named captures
for name in re.capture_names().flatten() {
if let Some(capture) = captures.name(name) {
let value = if redact {
redact_value(&String::from_utf8_lossy(capture.as_bytes()))
} else {
String::from_utf8_lossy(capture.as_bytes()).to_string()
};
serialized_captures.push(SerializableCapture {
name: Some(name.to_string()),
match_number: -1,
start: capture.start(),
end: capture.end(),
value: intern(&value),
});
}
}
// Process unnamed captures (numbered groups)
let capture_names: SmallVec<[Option<String>; 4]> =
re.capture_names().map(|name| name.map(str::to_string)).collect();
for i in 0..captures.len() {
if let Some(capture) = captures.get(i) {
if let Some(cap) = captures.get(i) {
let value = if redact {
redact_value(&String::from_utf8_lossy(capture.as_bytes()))
redact_value(&String::from_utf8_lossy(cap.as_bytes()))
} else {
String::from_utf8_lossy(capture.as_bytes()).to_string()
String::from_utf8_lossy(cap.as_bytes()).to_string()
};
let interned = intern(&value);
let name = capture_names.get(i).and_then(|opt| opt.as_ref()).cloned();
serialized_captures.push(SerializableCapture {
name: None,
name,
match_number: i32::try_from(i).unwrap_or(0),
start: capture.start(),
end: capture.end(),
value: intern(&value),
start: cap.start(),
end: cap.end(),
value: interned,
});
}
}
SerializableCaptures { captures: serialized_captures }
}
}
@ -1182,6 +1187,7 @@ mod test {
min_special_chars: None,
special_chars: None,
ignore_if_contains: Some(vec!["TEST".to_string()]),
checksum: None,
}),
})];
@ -1244,6 +1250,7 @@ mod test {
min_special_chars: None,
special_chars: None,
ignore_if_contains: Some(vec!["TEST".to_string()]),
checksum: None,
}),
})];
@ -1500,4 +1507,24 @@ line2
Ok(())
}
#[test]
fn serializes_captures_in_numeric_order() {
let re =
Regex::new(r"(?xi)\b(ghp_(?P<body>[A-Z0-9]{3})(?P<checksum>[A-Z0-9]{2}))").unwrap();
let caps = re.captures(b"ghp_ABC12").expect("expected captures");
let serialized = SerializableCaptures::from_captures(&caps, b"", &re, false);
let entries: Vec<(Option<&str>, i32, &str)> = serialized
.captures
.iter()
.map(|cap| (cap.name.as_deref(), cap.match_number, cap.value))
.collect();
assert_eq!(entries.len(), 4);
assert_eq!(entries[0], (None, 0, "ghp_ABC12"));
assert_eq!(entries[1], (None, 1, "ghp_ABC12"));
assert_eq!(entries[2], (Some("body"), 2, "ABC"));
assert_eq!(entries[3], (Some("checksum"), 3, "12"));
}
}

View file

@ -10,6 +10,10 @@ use std::{
use anyhow::{anyhow, Context, Result};
use lazy_static::lazy_static;
use liquid::{
model::{KString, Value},
object, Parser, ParserBuilder,
};
use regex::Regex;
use schemars::{
gen::SchemaGenerator,
@ -17,9 +21,12 @@ use schemars::{
JsonSchema,
};
use serde::{Deserialize, Serialize};
use tracing::debug;
// use sha1::{Digest, Sha1};
use xxhash_rust::xxh3::xxh3_64;
use crate::liquid_filters;
/// Returns false as the default value.
fn default_false() -> bool {
false
@ -73,6 +80,42 @@ pub struct PatternRequirements {
/// Words that should cause the match to be excluded when present (case-insensitive)
#[serde(default)]
pub ignore_if_contains: Option<Vec<String>>,
/// Optional checksum validation configuration.
#[serde(default)]
pub checksum: Option<ChecksumRequirement>,
}
/// Defines a checksum validation strategy for a matched pattern.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
pub struct ChecksumRequirement {
/// Template describing how to extract the checksum from the match.
pub actual: ChecksumActual,
/// Template describing how to compute the expected checksum.
pub expected: String,
/// When true, checksum evaluation is skipped if the required capture is missing.
#[serde(default)]
pub skip_if_missing: bool,
}
/// Describes how to extract the checksum value from a match.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
pub struct ChecksumActual {
/// Liquid template used to compute the checksum from the match.
pub template: String,
/// Optional capture group that must be present before evaluating the checksum.
#[serde(default)]
pub requires_capture: Option<String>,
}
/// Contextual information available when validating pattern requirements.
#[derive(Clone, Copy)]
pub struct PatternRequirementContext<'a> {
/// Compiled regex associated with the rule.
pub regex: &'a regex::bytes::Regex,
/// Captures for the current match.
pub captures: &'a regex::bytes::Captures<'a>,
/// Full bytes matched by the rule (capture group 0).
pub full_match: &'a [u8],
}
impl PatternRequirements {
@ -85,6 +128,7 @@ impl PatternRequirements {
pub fn validate(
&self,
input: &[u8],
context: Option<PatternRequirementContext<'_>>,
respect_ignore_if_contains: bool,
) -> PatternValidationResult {
// Convert to string (lossy for non-UTF8)
@ -151,10 +195,84 @@ impl PatternRequirements {
}
}
if let Some(checksum) = &self.checksum {
let Some(ctx) = context else {
return if checksum.skip_if_missing {
PatternValidationResult::Passed
} else {
PatternValidationResult::Failed
};
};
if let Some(required) = checksum.actual.requires_capture.as_deref() {
if ctx.captures.name(required).is_none() {
return if checksum.skip_if_missing {
PatternValidationResult::Passed
} else {
PatternValidationResult::Failed
};
}
}
let mut globals = object!({
"MATCH": s.to_string(),
"FULL_MATCH": String::from_utf8_lossy(ctx.full_match).to_string(),
});
for name in ctx.regex.capture_names().flatten() {
if let Some(capture) = ctx.captures.name(name) {
let value = String::from_utf8_lossy(capture.as_bytes()).to_string();
globals.insert(KString::from_ref(name), Value::scalar(value.clone()));
globals.insert(
KString::from_string(name.to_ascii_uppercase()),
Value::scalar(value),
);
}
}
let actual =
match render_pattern_requirement_template(&checksum.actual.template, &globals) {
Ok(rendered) => rendered,
Err(err) => {
debug!(
"Failed to render checksum actual template '{}': {}",
checksum.actual.template, err
);
return PatternValidationResult::Failed;
}
};
let expected = match render_pattern_requirement_template(&checksum.expected, &globals) {
Ok(rendered) => rendered,
Err(err) => {
debug!(
"Failed to render checksum expected template '{}': {}",
checksum.expected, err
);
return PatternValidationResult::Failed;
}
};
if actual != expected {
let actual_len = actual.chars().count();
let expected_len = expected.chars().count();
return PatternValidationResult::FailedChecksum { actual_len, expected_len };
}
}
PatternValidationResult::Passed
}
}
fn render_pattern_requirement_template(
template: &str,
globals: &liquid::Object,
) -> Result<String, String> {
PATTERN_REQUIREMENTS_TEMPLATE_PARSER
.parse(template)
.map_err(|e| e.to_string())
.and_then(|parsed| parsed.render(globals).map_err(|e| e.to_string()))
}
/// Result of validating [`PatternRequirements`] against a potential match.
#[derive(Debug, PartialEq, Eq)]
pub enum PatternValidationResult {
@ -162,6 +280,8 @@ pub enum PatternValidationResult {
Passed,
/// Requirements were not satisfied.
Failed,
/// Checksum requirements were not satisfied; captures basic mismatch details for debugging.
FailedChecksum { actual_len: usize, expected_len: usize },
/// The match contains one of the `ignore_if_contains` substrings and should be skipped.
IgnoredBySubstring { matched_term: String },
}
@ -407,6 +527,10 @@ lazy_static! {
pub static ref RULE_COMMENTS_PATTERN: Regex = Regex::new(
r"(?m)(\(\?#[^)]*\))|(\s\#[\sa-zA-Z]*$)"
).expect("comment-stripping regex should compile");
static ref PATTERN_REQUIREMENTS_TEMPLATE_PARSER: liquid::Parser =
liquid_filters::register_all(ParserBuilder::with_stdlib())
.build()
.expect("pattern requirement template parser should compile");
}
impl RuleSyntax {
@ -564,6 +688,7 @@ impl Rule {
#[cfg(test)]
mod tests {
use super::*;
use regex::bytes::Regex as BytesRegex;
#[test]
fn test_pattern_requirements_digits() {
@ -574,16 +699,75 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: None,
checksum: None,
};
// Should pass: has 3 digits
assert!(matches!(reqs.validate(b"abc123def", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"abc123def", None, true), PatternValidationResult::Passed));
// Should fail: only 1 digit
assert!(matches!(reqs.validate(b"abc1def", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abc1def", None, true), PatternValidationResult::Failed));
// Should fail: no digits
assert!(matches!(reqs.validate(b"abcdef", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abcdef", None, true), PatternValidationResult::Failed));
}
#[test]
fn test_pattern_requirements_checksum() {
let reqs = PatternRequirements {
min_digits: None,
min_uppercase: None,
min_lowercase: None,
min_special_chars: None,
special_chars: None,
ignore_if_contains: None,
checksum: Some(ChecksumRequirement {
actual: ChecksumActual {
template: "{{ MATCH | suffix: 6 }}".to_string(),
requires_capture: Some("checksum".to_string()),
},
expected: "{{ BODY | crc32 | base62: 6 }}".to_string(),
skip_if_missing: true,
}),
};
let token = b"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";
let regex =
BytesRegex::new(r"(?x) ghp_(?P<body>[A-Za-z0-9]{30})(?P<checksum>[A-Za-z0-9]{6})")
.unwrap();
let captures = regex.captures(token).expect("token should match");
assert!(matches!(
reqs.validate(
token,
Some(PatternRequirementContext {
regex: &regex,
captures: &captures,
full_match: token
}),
true
),
PatternValidationResult::Passed
));
let mut invalid = token.to_vec();
*invalid.last_mut().unwrap() = b'0';
let captures_invalid =
regex.captures(&invalid).expect("invalid token should still match pattern");
assert!(matches!(
reqs.validate(
&invalid,
Some(PatternRequirementContext {
regex: &regex,
captures: &captures_invalid,
full_match: &invalid,
}),
true
),
PatternValidationResult::FailedChecksum { .. }
));
let legacy = b"ghp_legacy_token";
assert!(matches!(reqs.validate(legacy, None, true), PatternValidationResult::Passed));
}
#[test]
@ -595,16 +779,17 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: None,
checksum: None,
};
// Should pass: has 3 uppercase
assert!(matches!(reqs.validate(b"ABCdef", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"ABCdef", None, true), PatternValidationResult::Passed));
// Should fail: only 1 uppercase
assert!(matches!(reqs.validate(b"Adef", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"Adef", None, true), PatternValidationResult::Failed));
// Should fail: no uppercase
assert!(matches!(reqs.validate(b"abcdef", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abcdef", None, true), PatternValidationResult::Failed));
}
#[test]
@ -616,16 +801,17 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: None,
checksum: None,
};
// Should pass: has 3 lowercase
assert!(matches!(reqs.validate(b"ABCdef", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"ABCdef", None, true), PatternValidationResult::Passed));
// Should fail: only 1 lowercase
assert!(matches!(reqs.validate(b"ABCd", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"ABCd", None, true), PatternValidationResult::Failed));
// Should fail: no lowercase
assert!(matches!(reqs.validate(b"ABC123", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"ABC123", None, true), PatternValidationResult::Failed));
}
#[test]
@ -637,16 +823,17 @@ mod tests {
min_special_chars: Some(2),
special_chars: None, // uses default
ignore_if_contains: None,
checksum: None,
};
// Should pass: has 2 special chars
assert!(matches!(reqs.validate(b"abc!@def", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"abc!@def", None, true), PatternValidationResult::Passed));
// Should fail: only 1 special char
assert!(matches!(reqs.validate(b"abc!def", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abc!def", None, true), PatternValidationResult::Failed));
// Should fail: no special chars
assert!(matches!(reqs.validate(b"abcdef", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abcdef", None, true), PatternValidationResult::Failed));
}
#[test]
@ -658,16 +845,17 @@ mod tests {
min_special_chars: Some(2),
special_chars: Some("$%^".to_string()),
ignore_if_contains: None,
checksum: None,
};
// Should pass: has 2 custom special chars
assert!(matches!(reqs.validate(b"abc$%def", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"abc$%def", None, true), PatternValidationResult::Passed));
// Should fail: has special chars but not the custom ones
assert!(matches!(reqs.validate(b"abc!@def", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abc!@def", None, true), PatternValidationResult::Failed));
// Should fail: only 1 custom special char
assert!(matches!(reqs.validate(b"abc$def", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abc$def", None, true), PatternValidationResult::Failed));
}
#[test]
@ -679,22 +867,23 @@ mod tests {
min_special_chars: Some(1),
special_chars: None,
ignore_if_contains: None,
checksum: None,
};
// Should pass: has all requirements
assert!(matches!(reqs.validate(b"Abc1!", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"Abc1!", None, true), PatternValidationResult::Passed));
// Should fail: missing digit
assert!(matches!(reqs.validate(b"Abc!", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"Abc!", None, true), PatternValidationResult::Failed));
// Should fail: missing uppercase
assert!(matches!(reqs.validate(b"abc1!", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"abc1!", None, true), PatternValidationResult::Failed));
// Should fail: missing lowercase
assert!(matches!(reqs.validate(b"ABC1!", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"ABC1!", None, true), PatternValidationResult::Failed));
// Should fail: missing special
assert!(matches!(reqs.validate(b"Abc1", true), PatternValidationResult::Failed));
assert!(matches!(reqs.validate(b"Abc1", None, true), PatternValidationResult::Failed));
}
#[test]
@ -706,22 +895,26 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: Some(vec!["test".to_string(), "Demo".to_string()]),
checksum: None,
};
// Should fail: contains "test" (case-insensitive)
assert!(matches!(
reqs.validate(b"MyTestToken", true),
reqs.validate(b"MyTestToken", None, true),
PatternValidationResult::IgnoredBySubstring { .. }
));
// Should fail: contains "demo" (case-insensitive)
assert!(matches!(
reqs.validate(b"example-demo-value", true),
reqs.validate(b"example-demo-value", None, true),
PatternValidationResult::IgnoredBySubstring { .. }
));
// Should pass: does not contain excluded words
assert!(matches!(reqs.validate(b"example-value", true), PatternValidationResult::Passed));
assert!(matches!(
reqs.validate(b"example-value", None, true),
PatternValidationResult::Passed
));
}
#[test]
@ -733,14 +926,15 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: Some(vec![" ".to_string(), "".to_string(), "BLOCK".to_string()]),
checksum: None,
};
// Should fail only when non-empty exclusion matches
assert!(matches!(
reqs.validate(b"needs-blocking", true),
reqs.validate(b"needs-blocking", None, true),
PatternValidationResult::IgnoredBySubstring { .. }
));
assert!(matches!(reqs.validate(b"allowed", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"allowed", None, true), PatternValidationResult::Passed));
}
#[test]
@ -752,16 +946,20 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: Some(vec!["ignoreme".to_string()]),
checksum: None,
};
// With ignoring enabled, the match is skipped
assert!(matches!(
reqs.validate(b"value-ignoreme", true),
reqs.validate(b"value-ignoreme", None, true),
PatternValidationResult::IgnoredBySubstring { .. }
));
// With ignoring disabled, the same input passes requirements
assert!(matches!(reqs.validate(b"value-ignoreme", false), PatternValidationResult::Passed));
assert!(matches!(
reqs.validate(b"value-ignoreme", None, false),
PatternValidationResult::Passed
));
}
#[test]
@ -773,11 +971,12 @@ mod tests {
min_special_chars: None,
special_chars: None,
ignore_if_contains: None,
checksum: None,
};
// Should pass: no requirements
assert!(matches!(reqs.validate(b"anything", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"123", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"!@#", true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"anything", None, true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"123", None, true), PatternValidationResult::Passed));
assert!(matches!(reqs.validate(b"!@#", None, true), PatternValidationResult::Passed));
}
}

View file

@ -332,9 +332,7 @@ async fn timed_validate_single_match<'a>(
}
let mut globals = Object::new();
for (k, v, ..) in &captured_values {
globals.insert(k.to_uppercase().into(), Value::scalar(v.clone()));
}
populate_globals_from_captures(&mut globals, &captured_values);
let rule_syntax = m.rule.syntax();
@ -961,6 +959,59 @@ async fn timed_validate_single_match<'a>(
commit_and_return(m);
}
fn populate_globals_from_captures(
globals: &mut Object,
captured_values: &[(String, String, usize, usize)],
) {
let mut best_token: Option<(usize, String)> = None;
for (k, v, ..) in captured_values {
let key = k.to_uppercase();
if key == "TOKEN" {
if best_token.as_ref().map_or(true, |(len, _)| v.len() >= *len) {
best_token = Some((v.len(), v.clone()));
}
} else {
globals.insert(key.into(), Value::scalar(v.clone()));
}
}
if let Some((_, token)) = best_token {
globals.insert("TOKEN".into(), Value::scalar(token));
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn populate_globals_prefers_longest_token() {
let captured_values = vec![
("TOKEN".to_string(), "short".to_string(), 0usize, 5usize),
("BODY".to_string(), "body".to_string(), 0usize, 4usize),
("TOKEN".to_string(), "longervalue".to_string(), 0usize, 11usize),
];
let mut globals = Object::new();
populate_globals_from_captures(&mut globals, &captured_values);
assert_eq!(globals.get("TOKEN").map(|v| v.to_string()), Some("longervalue".to_string()));
assert_eq!(globals.get("BODY").map(|v| v.to_string()), Some("body".to_string()));
}
#[test]
fn populate_globals_handles_missing_token() {
let captured_values = vec![("CHECKSUM".to_string(), "123456".to_string(), 0usize, 6usize)];
let mut globals = Object::new();
populate_globals_from_captures(&mut globals, &captured_values);
assert!(globals.get("TOKEN").is_none());
assert_eq!(globals.get("CHECKSUM").map(|v| v.to_string()), Some("123456".to_string()));
}
}
// #[cfg(test)]
// mod tests {
// use std::sync::Arc;

View file

@ -6,19 +6,11 @@ use crate::validation::SerializableCaptures;
/// Return (NAME, value, start, end) for every capture we care about.
///
/// * If a capture has a name, use that (upper-cased)
/// * If its unnamed, fall back to `"TOKEN"`
/// * Skip the unnamed “whole-match” capture **only when** there are
/// additional captures to return.
/// * If its unnamed, fall back to `"TOKEN"`
pub fn process_captures(captures: &SerializableCaptures) -> Vec<(String, String, usize, usize)> {
let multiple = captures.captures.len() > 1;
captures
.captures
.iter()
// Skip the whole-match capture (match_number == 0) only when there
// are additional captures. All other captures named or unnamed
// should be preserved.
.filter(|cap| !multiple || cap.match_number != 0)
.map(|cap| {
let name =
cap.name.as_ref().map(|n| n.to_uppercase()).unwrap_or_else(|| "TOKEN".to_string());
@ -140,7 +132,7 @@ mod tests {
}
#[test]
fn skips_whole_match_when_multiple() {
fn includes_whole_match_when_multiple() {
let captures = SerializableCaptures {
captures: smallvec![
SerializableCapture {
@ -160,11 +152,17 @@ mod tests {
],
};
let result = process_captures(&captures);
assert_eq!(result, vec![("FOO".to_string(), "bcd".to_string(), 1usize, 4usize)]);
assert_eq!(
result,
vec![
("TOKEN".to_string(), "abcde".to_string(), 0usize, 5usize),
("FOO".to_string(), "bcd".to_string(), 1usize, 4usize),
]
);
}
#[test]
fn includes_unnamed_groups_but_skips_whole_match() {
fn includes_whole_match_and_unnamed_groups() {
let captures = SerializableCaptures {
captures: smallvec![
SerializableCapture {
@ -188,6 +186,7 @@ mod tests {
assert_eq!(
result,
vec![
("TOKEN".to_string(), "aabbcc".to_string(), 0usize, 6usize),
("FOO".to_string(), "aa".to_string(), 0usize, 2usize),
("TOKEN".to_string(), "cc".to_string(), 4usize, 6usize),
]