C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers

A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite git-history URL space and timing out Forgejo. robots.txt already disallowed /mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~ to beat the regex asset locations), served as a roll-of-dishonour page with an X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection are documented in docs/explanation/ai-scraper-mitigation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00 · 2026-06-01 20:52:20 -07:00 · a36a18aaa6
commit a36a18aaa6
parent e0064de83d
7 changed files with 302 additions and 0 deletions
--- a/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
+++ b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
@ -0,0 +1 @@
+Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.
--- a/docs/changelog.d/+forge-mirrors-blackhole.infra.md
+++ b/docs/changelog.d/+forge-mirrors-blackhole.infra.md
@ -0,0 +1 @@
+Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.
--- a/docs/explanation/ai-scraper-mitigation.md
+++ b/docs/explanation/ai-scraper-mitigation.md
@ -0,0 +1,201 @@
+---
+title: AI Scraper Mitigation
+modified: 2026-06-01
+last-reviewed: 2026-06-01
+tags:
+  - explanation
+  - fly-io
+  - forgejo
+  - security
+  - networking
+---
+
+# AI Scraper Mitigation on the Public Proxy
+
+> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.
+
+How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]].
+
+## The incident
+
+A $29.60 Fly.io invoice arrived, nearly all of it a single line:
+
+```
+Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17
+```
+
+The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`,
+but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means
+"the traffic went to clients on the US East Coast."
+
+Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]):
+
+| Signal | Value |
+|--------|-------|
+| Total proxy egress (30d) | ~1.25 TB |
+| Share that was `forge.eblu.me` | **99.95%** |
+| Share of forge egress that was `/mirrors/*` | **~71%** |
+| Share that was declared AI bots | **~85%+** |
+| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider |
+| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k |
+
+The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints —
+`src/commit/<sha>`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff`
+and `?page=N` pagination. That URL space is effectively **infinite**: every
+file × every commit × every page, multiplied across every mirrored repo. A
+crawler that follows links never finishes, and every page is a cache `MISS`
+that both tunnels to indri *and* bills as egress.
+
+Two distinct harms, not one:
+
+1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy.
+2. **Availability** — the crawl alone generates ~400–530k requests/day,
+   enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving
+   egress elsewhere would *not* fix this; the crawl has to be throttled at the
+   source.
+
+`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download
+paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents,
+`robots.txt` is a dead letter, which is why edge enforcement is required.
+
+## The tiered plan
+
+### Tier 1 — Black-hole `/mirrors/*` (shipped)
+
+The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`,
+…) are mirrors of *already-public upstreams*, kept for supply-chain control
+(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They
+are consumed by CI, gilbert, and other tailnet clients over
+`forge.ops.eblu.me`. Their web UI on the public internet served **no
+legitimate audience** — only scrapers. So the proxy now returns `403` for
+anything under `/mirrors/`, pointing humans at the tailnet host:
+
+```nginx
+location ^~ /mirrors/ {
+    return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n";
+}
+```
+
+The `^~` modifier matters: without it, the regex `location` blocks for static
+assets (`*.css`, `*.js`, release downloads) would match first and leak content
+under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the
+regex round.
+
+This is config, not bot-fighting — we simply stopped serving an infinite
+tarpit to the world. It removes ~71% of forge egress and a large share of the
+upstream timeouts, with zero impact on any human or tailnet consumer. It
+mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`.
+
+The `403` is also a small act of public shaming. Blocked requests are served a
+"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via
+`error_page 403 /naughty.html`) that names the offending operators and their
+share of the stolen bytes, and every response carries an `X-Naughty-Scrapers`
+header:
+
+```
+X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers
+```
+
+Petty? A little. But it costs nothing, documents *why* the block exists for the
+next person who hits it, and the page is a few KB versus the megabytes of git
+HTML the crawlers were taking.
+
+**Trade-off accepted:** mirror release-artifact downloads over WAN now also
+`403`. Legitimate consumers already pull these over the tailnet, and the public
+exposure was the same crawl liability, so this is intentional.
+
+### Tier 2 — Defend the repos that *stay* public (planned)
+
+`/eblume/*` is intentionally public (a public profile is a feature). But the
+same git-history endpoints are still a tarpit there, just lower-volume. Two
+layers, in increasing order of effort and effectiveness:
+
+#### 2a. User-agent denylist (cheap, evadable)
+
+Block the declared AI crawlers at the edge regardless of path:
+
+```nginx
+# Illustrative — not yet deployed.
+map $http_user_agent $is_ai_bot {
+    default                 0;
+    "~*meta-externalagent"  1;
+    "~*GPTBot"              1;
+    "~*ClaudeBot"          1;
+    "~*Amazonbot"          1;
+    "~*Bytespider"         1;
+    "~*SemrushBot"         1;
+}
+# in the forge.eblu.me server block:
+if ($is_ai_bot) { return 403; }
+```
+
+This catches ~85% of *current* traffic for a few lines of config. It is
+trivially evadable — a scraper need only spoof a browser UA — so it is a
+speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers
+(Googlebot, Bingbot) do honor it, and it documents intent.
+
+#### 2b. Anubis proof-of-work gateway (the real wall)
+
+[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that
+weighs each request with a browser-based proof-of-work challenge before passing
+it upstream. It was written for *exactly this scenario* — its author built it
+after Amazon's scraper took down their Git server — and is widely deployed in
+front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't
+run the challenge JS never reach the application; humans clear it once and
+proceed.
+
+Why it fits BlumeOps better than the alternatives:
+
+- **It attacks cost *and* availability at once.** Bots receive a few-KB
+  challenge page instead of MB of git HTML (egress collapses) and never reach
+  Forgejo (timeouts collapse). No other single lever does both.
+- **It stays in-house.** No third party terminates our TLS or sees our
+  traffic.
+
+Placement options:
+
+| Where | Pros | Cons |
+|-------|------|------|
+| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) |
+| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy |
+
+Leaning toward Caddy-side on indri for simplicity and uniform coverage, but
+this is the open design question for Tier 2. Anubis is MIT-licensed and the
+author has signalled a future move to an `equi-x`-based challenge, so pin a
+version and track upstream.
+
+### Tier 3 — Move egress off Fly entirely (rejected)
+
+A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare
+edge) would make this a non-problem on the cost axis: Cloudflare does not meter
+proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the
+"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would
+zero the egress bill and add bot defense.
+
+**We are not doing this, on principle.** Cloudflare is a solid platform and a
+defensible engineering choice — but it already sits in front of an enormous
+fraction of the modern web, and routing BlumeOps through it would add one more
+site to the pile of the internet that one company can see and gate. BlumeOps
+deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale
+ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in
+that doc). This is a values decision, not a technical one: we would rather pay
+a few dollars and run our own mitigation than centralize on Cloudflare.
+
+It is also worth noting that **Tier 3 would not, by itself, fix the upstream
+timeouts** — free egress just means we'd stop *caring* that bots crawl, while
+they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is
+required regardless of where egress is billed.
+
+## Summary
+
+| Tier | Lever | Cost | Availability | Status |
+|------|-------|------|--------------|--------|
+| 1 | Black-hole `/mirrors/*` at edge | −~71% | big drop | **shipped** |
+| 2a | UA denylist on remaining repos | −most of the rest | further drop | planned |
+| 2b | Anubis PoW gateway | −near-total | near-total | planned |
+| 3 | Cloudflare Tunnel | −total | needs 2b anyway | **rejected (principle)** |
+
+The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an
+infinite-URL surface that has no human audience.** Everything past Tier 1 is
+about defending the surface we *do* want public, in-house, without ceding
+control of our traffic to a third party.
--- a/docs/tutorials/expose-service-publicly.md
+++ b/docs/tutorials/expose-service-publicly.md
@ -376,6 +376,13 @@ Mitigations for dynamic services:
 - fail2ban on indri (see below) can block IPs showing abuse patterns
 - The break-glass shutoff remains the last resort

+The most acute version of this in practice has been **AI scrapers**, which
+ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s
+infinite git-history URL space) into both a surprise egress bill and an
+effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered
+defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and
+why a Cloudflare Tunnel is *not* the chosen answer here.
+
 If a publicly exposed dynamic service attracts targeted attacks or the
 home network bandwidth is impacted, consider migrating to Cloudflare
 Tunnel for enterprise-grade DDoS protection (requires DNS migration;
				`@ -0,0 +1 @@`
				Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.
				`@ -0,0 +1 @@`
				Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.