C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s

A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-06-01 20:52:20 -07:00
commit a36a18aaa6
7 changed files with 302 additions and 0 deletions

View file

@ -0,0 +1 @@
Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.

View file

@ -0,0 +1 @@
Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403``forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.

View file

@ -0,0 +1,201 @@
---
title: AI Scraper Mitigation
modified: 2026-06-01
last-reviewed: 2026-06-01
tags:
- explanation
- fly-io
- forgejo
- security
- networking
---
# AI Scraper Mitigation on the Public Proxy
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.
How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]].
## The incident
A $29.60 Fly.io invoice arrived, nearly all of it a single line:
```
Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17
```
The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`,
but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means
"the traffic went to clients on the US East Coast."
Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]):
| Signal | Value |
|--------|-------|
| Total proxy egress (30d) | ~1.25 TB |
| Share that was `forge.eblu.me` | **99.95%** |
| Share of forge egress that was `/mirrors/*` | **~71%** |
| Share that was declared AI bots | **~85%+** |
| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider |
| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k |
The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints —
`src/commit/<sha>`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff`
and `?page=N` pagination. That URL space is effectively **infinite**: every
file × every commit × every page, multiplied across every mirrored repo. A
crawler that follows links never finishes, and every page is a cache `MISS`
that both tunnels to indri *and* bills as egress.
Two distinct harms, not one:
1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy.
2. **Availability** — the crawl alone generates ~400530k requests/day,
enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving
egress elsewhere would *not* fix this; the crawl has to be throttled at the
source.
`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download
paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents,
`robots.txt` is a dead letter, which is why edge enforcement is required.
## The tiered plan
### Tier 1 — Black-hole `/mirrors/*` (shipped)
The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`,
…) are mirrors of *already-public upstreams*, kept for supply-chain control
(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They
are consumed by CI, gilbert, and other tailnet clients over
`forge.ops.eblu.me`. Their web UI on the public internet served **no
legitimate audience** — only scrapers. So the proxy now returns `403` for
anything under `/mirrors/`, pointing humans at the tailnet host:
```nginx
location ^~ /mirrors/ {
return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n";
}
```
The `^~` modifier matters: without it, the regex `location` blocks for static
assets (`*.css`, `*.js`, release downloads) would match first and leak content
under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the
regex round.
This is config, not bot-fighting — we simply stopped serving an infinite
tarpit to the world. It removes ~71% of forge egress and a large share of the
upstream timeouts, with zero impact on any human or tailnet consumer. It
mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`.
The `403` is also a small act of public shaming. Blocked requests are served a
"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via
`error_page 403 /naughty.html`) that names the offending operators and their
share of the stolen bytes, and every response carries an `X-Naughty-Scrapers`
header:
```
X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers
```
Petty? A little. But it costs nothing, documents *why* the block exists for the
next person who hits it, and the page is a few KB versus the megabytes of git
HTML the crawlers were taking.
**Trade-off accepted:** mirror release-artifact downloads over WAN now also
`403`. Legitimate consumers already pull these over the tailnet, and the public
exposure was the same crawl liability, so this is intentional.
### Tier 2 — Defend the repos that *stay* public (planned)
`/eblume/*` is intentionally public (a public profile is a feature). But the
same git-history endpoints are still a tarpit there, just lower-volume. Two
layers, in increasing order of effort and effectiveness:
#### 2a. User-agent denylist (cheap, evadable)
Block the declared AI crawlers at the edge regardless of path:
```nginx
# Illustrative — not yet deployed.
map $http_user_agent $is_ai_bot {
default 0;
"~*meta-externalagent" 1;
"~*GPTBot" 1;
"~*ClaudeBot" 1;
"~*Amazonbot" 1;
"~*Bytespider" 1;
"~*SemrushBot" 1;
}
# in the forge.eblu.me server block:
if ($is_ai_bot) { return 403; }
```
This catches ~85% of *current* traffic for a few lines of config. It is
trivially evadable — a scraper need only spoof a browser UA — so it is a
speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers
(Googlebot, Bingbot) do honor it, and it documents intent.
#### 2b. Anubis proof-of-work gateway (the real wall)
[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that
weighs each request with a browser-based proof-of-work challenge before passing
it upstream. It was written for *exactly this scenario* — its author built it
after Amazon's scraper took down their Git server — and is widely deployed in
front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't
run the challenge JS never reach the application; humans clear it once and
proceed.
Why it fits BlumeOps better than the alternatives:
- **It attacks cost *and* availability at once.** Bots receive a few-KB
challenge page instead of MB of git HTML (egress collapses) and never reach
Forgejo (timeouts collapse). No other single lever does both.
- **It stays in-house.** No third party terminates our TLS or sees our
traffic.
Placement options:
| Where | Pros | Cons |
|-------|------|------|
| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) |
| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy |
Leaning toward Caddy-side on indri for simplicity and uniform coverage, but
this is the open design question for Tier 2. Anubis is MIT-licensed and the
author has signalled a future move to an `equi-x`-based challenge, so pin a
version and track upstream.
### Tier 3 — Move egress off Fly entirely (rejected)
A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare
edge) would make this a non-problem on the cost axis: Cloudflare does not meter
proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the
"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would
zero the egress bill and add bot defense.
**We are not doing this, on principle.** Cloudflare is a solid platform and a
defensible engineering choice — but it already sits in front of an enormous
fraction of the modern web, and routing BlumeOps through it would add one more
site to the pile of the internet that one company can see and gate. BlumeOps
deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale
+ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in
that doc). This is a values decision, not a technical one: we would rather pay
a few dollars and run our own mitigation than centralize on Cloudflare.
It is also worth noting that **Tier 3 would not, by itself, fix the upstream
timeouts** — free egress just means we'd stop *caring* that bots crawl, while
they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is
required regardless of where egress is billed.
## Summary
| Tier | Lever | Cost | Availability | Status |
|------|-------|------|--------------|--------|
| 1 | Black-hole `/mirrors/*` at edge | ~71% | big drop | **shipped** |
| 2a | UA denylist on remaining repos | most of the rest | further drop | planned |
| 2b | Anubis PoW gateway | near-total | near-total | planned |
| 3 | Cloudflare Tunnel | total | needs 2b anyway | **rejected (principle)** |
The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an
infinite-URL surface that has no human audience.** Everything past Tier 1 is
about defending the surface we *do* want public, in-house, without ceding
control of our traffic to a third party.

View file

@ -376,6 +376,13 @@ Mitigations for dynamic services:
- fail2ban on indri (see below) can block IPs showing abuse patterns
- The break-glass shutoff remains the last resort
The most acute version of this in practice has been **AI scrapers**, which
ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s
infinite git-history URL space) into both a surprise egress bill and an
effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered
defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and
why a Cloudflare Tunnel is *not* the chosen answer here.
If a publicly exposed dynamic service attracts targeted attacks or the
home network bandwidth is impacted, consider migrating to Cloudflare
Tunnel for enterprise-grade DDoS protection (requires DNS migration;