C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite git-history URL space and timing out Forgejo. robots.txt already disallowed /mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~ to beat the regex asset locations), served as a roll-of-dishonour page with an X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection are documented in docs/explanation/ai-scraper-mitigation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e0064de83d
commit
a36a18aaa6
7 changed files with 302 additions and 0 deletions
|
|
@ -215,6 +215,33 @@ http {
|
|||
return 403 "API documentation is only available at forge.ops.eblu.me (tailnet).\n";
|
||||
}
|
||||
|
||||
# Black-hole the mirror repositories on WAN. These are mirrors of
|
||||
# already-public upstreams (tailscale, prometheus, mealie, …) kept
|
||||
# for supply-chain control; CI, gilbert, and tailnet clients consume
|
||||
# them via forge.ops.eblu.me. Their web UI served no public purpose
|
||||
# but AI scrapers, which crawled the near-infinite git-history URL
|
||||
# space (src/commit, commits, blame, raw) and drove ~70% of Fly
|
||||
# egress (1.24 TB/30d → a surprise bill) plus enough upstream load to
|
||||
# time out Forgejo. robots.txt already Disallows /mirrors/, but
|
||||
# meta-externalagent and GPTBot ignore it — so enforce at the edge.
|
||||
# `^~` makes this win over the regex locations below (e.g. *.css), so
|
||||
# static assets under /mirrors/ can't leak through. We also name and
|
||||
# shame: blocked requests get a "roll of dishonour" page (403 status
|
||||
# preserved) and an X-Naughty-Scrapers header. See
|
||||
# docs/explanation/ai-scraper-mitigation.md.
|
||||
location ^~ /mirrors/ {
|
||||
error_page 403 /naughty.html;
|
||||
return 403;
|
||||
}
|
||||
|
||||
# Roll of dishonour — served on the /mirrors/ 403, status kept at 403.
|
||||
location = /naughty.html {
|
||||
internal;
|
||||
root /usr/share/nginx/html;
|
||||
add_header X-Naughty-Scrapers "OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers" always;
|
||||
add_header X-Clacks-Overhead "GNU Terry Pratchett" always;
|
||||
}
|
||||
|
||||
# Redirect archive endpoints to tailnet — archive requests generate full
|
||||
# git bundles on demand. Unauthenticated crawlers hitting unique commit
|
||||
# SHAs cause unbounded CPU and disk usage (DoS vector). Legitimate users
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue