diff --git a/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md new file mode 100644 index 0000000..246fedb --- /dev/null +++ b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md @@ -0,0 +1 @@ +Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it. diff --git a/docs/changelog.d/+forge-mirrors-blackhole.infra.md b/docs/changelog.d/+forge-mirrors-blackhole.infra.md new file mode 100644 index 0000000..29a5e6a --- /dev/null +++ b/docs/changelog.d/+forge-mirrors-blackhole.infra.md @@ -0,0 +1 @@ +Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`. diff --git a/docs/explanation/ai-scraper-mitigation.md b/docs/explanation/ai-scraper-mitigation.md new file mode 100644 index 0000000..fe4ba3d --- /dev/null +++ b/docs/explanation/ai-scraper-mitigation.md @@ -0,0 +1,201 @@ +--- +title: AI Scraper Mitigation +modified: 2026-06-01 +last-reviewed: 2026-06-01 +tags: + - explanation + - fly-io + - forgejo + - security + - networking +--- + +# AI Scraper Mitigation on the Public Proxy + +> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure. + +How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]]. + +## The incident + +A $29.60 Fly.io invoice arrived, nearly all of it a single line: + +``` +Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17 +``` + +The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`, +but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means +"the traffic went to clients on the US East Coast." + +Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]): + +| Signal | Value | +|--------|-------| +| Total proxy egress (30d) | ~1.25 TB | +| Share that was `forge.eblu.me` | **99.95%** | +| Share of forge egress that was `/mirrors/*` | **~71%** | +| Share that was declared AI bots | **~85%+** | +| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider | +| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k | + +The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints — +`src/commit/`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff` +and `?page=N` pagination. That URL space is effectively **infinite**: every +file × every commit × every page, multiplied across every mirrored repo. A +crawler that follows links never finishes, and every page is a cache `MISS` +that both tunnels to indri *and* bills as egress. + +Two distinct harms, not one: + +1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy. +2. **Availability** — the crawl alone generates ~400–530k requests/day, + enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving + egress elsewhere would *not* fix this; the crawl has to be throttled at the + source. + +`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download +paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents, +`robots.txt` is a dead letter, which is why edge enforcement is required. + +## The tiered plan + +### Tier 1 — Black-hole `/mirrors/*` (shipped) + +The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`, +…) are mirrors of *already-public upstreams*, kept for supply-chain control +(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They +are consumed by CI, gilbert, and other tailnet clients over +`forge.ops.eblu.me`. Their web UI on the public internet served **no +legitimate audience** — only scrapers. So the proxy now returns `403` for +anything under `/mirrors/`, pointing humans at the tailnet host: + +```nginx +location ^~ /mirrors/ { + return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n"; +} +``` + +The `^~` modifier matters: without it, the regex `location` blocks for static +assets (`*.css`, `*.js`, release downloads) would match first and leak content +under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the +regex round. + +This is config, not bot-fighting — we simply stopped serving an infinite +tarpit to the world. It removes ~71% of forge egress and a large share of the +upstream timeouts, with zero impact on any human or tailnet consumer. It +mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`. + +The `403` is also a small act of public shaming. Blocked requests are served a +"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via +`error_page 403 /naughty.html`) that names the offending operators and their +share of the stolen bytes, and every response carries an `X-Naughty-Scrapers` +header: + +``` +X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers +``` + +Petty? A little. But it costs nothing, documents *why* the block exists for the +next person who hits it, and the page is a few KB versus the megabytes of git +HTML the crawlers were taking. + +**Trade-off accepted:** mirror release-artifact downloads over WAN now also +`403`. Legitimate consumers already pull these over the tailnet, and the public +exposure was the same crawl liability, so this is intentional. + +### Tier 2 — Defend the repos that *stay* public (planned) + +`/eblume/*` is intentionally public (a public profile is a feature). But the +same git-history endpoints are still a tarpit there, just lower-volume. Two +layers, in increasing order of effort and effectiveness: + +#### 2a. User-agent denylist (cheap, evadable) + +Block the declared AI crawlers at the edge regardless of path: + +```nginx +# Illustrative — not yet deployed. +map $http_user_agent $is_ai_bot { + default 0; + "~*meta-externalagent" 1; + "~*GPTBot" 1; + "~*ClaudeBot" 1; + "~*Amazonbot" 1; + "~*Bytespider" 1; + "~*SemrushBot" 1; +} +# in the forge.eblu.me server block: +if ($is_ai_bot) { return 403; } +``` + +This catches ~85% of *current* traffic for a few lines of config. It is +trivially evadable — a scraper need only spoof a browser UA — so it is a +speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers +(Googlebot, Bingbot) do honor it, and it documents intent. + +#### 2b. Anubis proof-of-work gateway (the real wall) + +[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that +weighs each request with a browser-based proof-of-work challenge before passing +it upstream. It was written for *exactly this scenario* — its author built it +after Amazon's scraper took down their Git server — and is widely deployed in +front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't +run the challenge JS never reach the application; humans clear it once and +proceed. + +Why it fits BlumeOps better than the alternatives: + +- **It attacks cost *and* availability at once.** Bots receive a few-KB + challenge page instead of MB of git HTML (egress collapses) and never reach + Forgejo (timeouts collapse). No other single lever does both. +- **It stays in-house.** No third party terminates our TLS or sees our + traffic. + +Placement options: + +| Where | Pros | Cons | +|-------|------|------| +| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) | +| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy | + +Leaning toward Caddy-side on indri for simplicity and uniform coverage, but +this is the open design question for Tier 2. Anubis is MIT-licensed and the +author has signalled a future move to an `equi-x`-based challenge, so pin a +version and track upstream. + +### Tier 3 — Move egress off Fly entirely (rejected) + +A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare +edge) would make this a non-problem on the cost axis: Cloudflare does not meter +proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the +"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would +zero the egress bill and add bot defense. + +**We are not doing this, on principle.** Cloudflare is a solid platform and a +defensible engineering choice — but it already sits in front of an enormous +fraction of the modern web, and routing BlumeOps through it would add one more +site to the pile of the internet that one company can see and gate. BlumeOps +deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale ++ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in +that doc). This is a values decision, not a technical one: we would rather pay +a few dollars and run our own mitigation than centralize on Cloudflare. + +It is also worth noting that **Tier 3 would not, by itself, fix the upstream +timeouts** — free egress just means we'd stop *caring* that bots crawl, while +they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is +required regardless of where egress is billed. + +## Summary + +| Tier | Lever | Cost | Availability | Status | +|------|-------|------|--------------|--------| +| 1 | Black-hole `/mirrors/*` at edge | −~71% | big drop | **shipped** | +| 2a | UA denylist on remaining repos | −most of the rest | further drop | planned | +| 2b | Anubis PoW gateway | −near-total | near-total | planned | +| 3 | Cloudflare Tunnel | −total | needs 2b anyway | **rejected (principle)** | + +The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an +infinite-URL surface that has no human audience.** Everything past Tier 1 is +about defending the surface we *do* want public, in-house, without ceding +control of our traffic to a third party. diff --git a/docs/tutorials/expose-service-publicly.md b/docs/tutorials/expose-service-publicly.md index 886cad4..65af611 100644 --- a/docs/tutorials/expose-service-publicly.md +++ b/docs/tutorials/expose-service-publicly.md @@ -376,6 +376,13 @@ Mitigations for dynamic services: - fail2ban on indri (see below) can block IPs showing abuse patterns - The break-glass shutoff remains the last resort +The most acute version of this in practice has been **AI scrapers**, which +ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s +infinite git-history URL space) into both a surprise egress bill and an +effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered +defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and +why a Cloudflare Tunnel is *not* the chosen answer here. + If a publicly exposed dynamic service attracts targeted attacks or the home network bandwidth is impacted, consider migrating to Cloudflare Tunnel for enterprise-grade DDoS protection (requires DNS migration; diff --git a/fly/Dockerfile b/fly/Dockerfile index d4e7a18..406c849 100644 --- a/fly/Dockerfile +++ b/fly/Dockerfile @@ -25,6 +25,7 @@ COPY fail2ban/action.d/nginx-deny.conf /etc/fail2ban/action.d/nginx-deny.conf COPY nginx.conf /etc/nginx/nginx.conf COPY error.html /usr/share/nginx/html/error.html +COPY naughty.html /usr/share/nginx/html/naughty.html COPY alloy.river /etc/alloy/config.alloy COPY start.sh /start.sh RUN chmod +x /start.sh diff --git a/fly/naughty.html b/fly/naughty.html new file mode 100644 index 0000000..d899171 --- /dev/null +++ b/fly/naughty.html @@ -0,0 +1,64 @@ + + + + + + + 403 · Roll of Dishonour + + + +
+

🪤 403 — you walked into the scraper trap

+

These are mirror repositories. They are tailnet-only.

+ +

+ This path used to serve the web UI for mirrors of public upstream + projects. It exists for supply-chain control, not for crawling. A + robots.txt politely disallowed /mirrors/. + A pack of AI scrapers ignored it, walked the infinite git-history URL + space, and ran up ~1.25 TB of egress and a real + money bill in a single month — while timing out the server for everyone + else. +

+ +

So /mirrors/ is closed at the edge now. Roll of dishonour, + by share of the bytes they stole:

+ + + + + + + + + +
OperatorUser-Agent
Metameta-externalagent
OpenAIGPTBot
AmazonAmazonbot
ByteDanceBytespider
+ +

+ If you are a human who actually wanted these mirrors, they are reachable + from the tailnet at forge.ops.eblu.me. If you are a crawler: + read the robots.txt next time. We left you a header, too. +

+ + +
+ + diff --git a/fly/nginx.conf b/fly/nginx.conf index 570e6c9..ec35774 100644 --- a/fly/nginx.conf +++ b/fly/nginx.conf @@ -215,6 +215,33 @@ http { return 403 "API documentation is only available at forge.ops.eblu.me (tailnet).\n"; } + # Black-hole the mirror repositories on WAN. These are mirrors of + # already-public upstreams (tailscale, prometheus, mealie, …) kept + # for supply-chain control; CI, gilbert, and tailnet clients consume + # them via forge.ops.eblu.me. Their web UI served no public purpose + # but AI scrapers, which crawled the near-infinite git-history URL + # space (src/commit, commits, blame, raw) and drove ~70% of Fly + # egress (1.24 TB/30d → a surprise bill) plus enough upstream load to + # time out Forgejo. robots.txt already Disallows /mirrors/, but + # meta-externalagent and GPTBot ignore it — so enforce at the edge. + # `^~` makes this win over the regex locations below (e.g. *.css), so + # static assets under /mirrors/ can't leak through. We also name and + # shame: blocked requests get a "roll of dishonour" page (403 status + # preserved) and an X-Naughty-Scrapers header. See + # docs/explanation/ai-scraper-mitigation.md. + location ^~ /mirrors/ { + error_page 403 /naughty.html; + return 403; + } + + # Roll of dishonour — served on the /mirrors/ 403, status kept at 403. + location = /naughty.html { + internal; + root /usr/share/nginx/html; + add_header X-Naughty-Scrapers "OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers" always; + add_header X-Clacks-Overhead "GNU Terry Pratchett" always; + } + # Redirect archive endpoints to tailnet — archive requests generate full # git bundles on demand. Unauthenticated crawlers hitting unique commit # SHAs cause unbounded CPU and disk usage (DoS vector). Legitimate users