C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite git-history URL space and timing out Forgejo. robots.txt already disallowed /mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~ to beat the regex asset locations), served as a roll-of-dishonour page with an X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection are documented in docs/explanation/ai-scraper-mitigation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e0064de83d
commit
a36a18aaa6
7 changed files with 302 additions and 0 deletions
1
docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
Normal file
1
docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
Normal file
|
|
@ -0,0 +1 @@
|
|||
Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.
|
||||
1
docs/changelog.d/+forge-mirrors-blackhole.infra.md
Normal file
1
docs/changelog.d/+forge-mirrors-blackhole.infra.md
Normal file
|
|
@ -0,0 +1 @@
|
|||
Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.
|
||||
201
docs/explanation/ai-scraper-mitigation.md
Normal file
201
docs/explanation/ai-scraper-mitigation.md
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
---
|
||||
title: AI Scraper Mitigation
|
||||
modified: 2026-06-01
|
||||
last-reviewed: 2026-06-01
|
||||
tags:
|
||||
- explanation
|
||||
- fly-io
|
||||
- forgejo
|
||||
- security
|
||||
- networking
|
||||
---
|
||||
|
||||
# AI Scraper Mitigation on the Public Proxy
|
||||
|
||||
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.
|
||||
|
||||
How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]].
|
||||
|
||||
## The incident
|
||||
|
||||
A $29.60 Fly.io invoice arrived, nearly all of it a single line:
|
||||
|
||||
```
|
||||
Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17
|
||||
```
|
||||
|
||||
The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`,
|
||||
but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means
|
||||
"the traffic went to clients on the US East Coast."
|
||||
|
||||
Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]):
|
||||
|
||||
| Signal | Value |
|
||||
|--------|-------|
|
||||
| Total proxy egress (30d) | ~1.25 TB |
|
||||
| Share that was `forge.eblu.me` | **99.95%** |
|
||||
| Share of forge egress that was `/mirrors/*` | **~71%** |
|
||||
| Share that was declared AI bots | **~85%+** |
|
||||
| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider |
|
||||
| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k |
|
||||
|
||||
The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints —
|
||||
`src/commit/<sha>`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff`
|
||||
and `?page=N` pagination. That URL space is effectively **infinite**: every
|
||||
file × every commit × every page, multiplied across every mirrored repo. A
|
||||
crawler that follows links never finishes, and every page is a cache `MISS`
|
||||
that both tunnels to indri *and* bills as egress.
|
||||
|
||||
Two distinct harms, not one:
|
||||
|
||||
1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy.
|
||||
2. **Availability** — the crawl alone generates ~400–530k requests/day,
|
||||
enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving
|
||||
egress elsewhere would *not* fix this; the crawl has to be throttled at the
|
||||
source.
|
||||
|
||||
`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download
|
||||
paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents,
|
||||
`robots.txt` is a dead letter, which is why edge enforcement is required.
|
||||
|
||||
## The tiered plan
|
||||
|
||||
### Tier 1 — Black-hole `/mirrors/*` (shipped)
|
||||
|
||||
The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`,
|
||||
…) are mirrors of *already-public upstreams*, kept for supply-chain control
|
||||
(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They
|
||||
are consumed by CI, gilbert, and other tailnet clients over
|
||||
`forge.ops.eblu.me`. Their web UI on the public internet served **no
|
||||
legitimate audience** — only scrapers. So the proxy now returns `403` for
|
||||
anything under `/mirrors/`, pointing humans at the tailnet host:
|
||||
|
||||
```nginx
|
||||
location ^~ /mirrors/ {
|
||||
return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n";
|
||||
}
|
||||
```
|
||||
|
||||
The `^~` modifier matters: without it, the regex `location` blocks for static
|
||||
assets (`*.css`, `*.js`, release downloads) would match first and leak content
|
||||
under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the
|
||||
regex round.
|
||||
|
||||
This is config, not bot-fighting — we simply stopped serving an infinite
|
||||
tarpit to the world. It removes ~71% of forge egress and a large share of the
|
||||
upstream timeouts, with zero impact on any human or tailnet consumer. It
|
||||
mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`.
|
||||
|
||||
The `403` is also a small act of public shaming. Blocked requests are served a
|
||||
"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via
|
||||
`error_page 403 /naughty.html`) that names the offending operators and their
|
||||
share of the stolen bytes, and every response carries an `X-Naughty-Scrapers`
|
||||
header:
|
||||
|
||||
```
|
||||
X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers
|
||||
```
|
||||
|
||||
Petty? A little. But it costs nothing, documents *why* the block exists for the
|
||||
next person who hits it, and the page is a few KB versus the megabytes of git
|
||||
HTML the crawlers were taking.
|
||||
|
||||
**Trade-off accepted:** mirror release-artifact downloads over WAN now also
|
||||
`403`. Legitimate consumers already pull these over the tailnet, and the public
|
||||
exposure was the same crawl liability, so this is intentional.
|
||||
|
||||
### Tier 2 — Defend the repos that *stay* public (planned)
|
||||
|
||||
`/eblume/*` is intentionally public (a public profile is a feature). But the
|
||||
same git-history endpoints are still a tarpit there, just lower-volume. Two
|
||||
layers, in increasing order of effort and effectiveness:
|
||||
|
||||
#### 2a. User-agent denylist (cheap, evadable)
|
||||
|
||||
Block the declared AI crawlers at the edge regardless of path:
|
||||
|
||||
```nginx
|
||||
# Illustrative — not yet deployed.
|
||||
map $http_user_agent $is_ai_bot {
|
||||
default 0;
|
||||
"~*meta-externalagent" 1;
|
||||
"~*GPTBot" 1;
|
||||
"~*ClaudeBot" 1;
|
||||
"~*Amazonbot" 1;
|
||||
"~*Bytespider" 1;
|
||||
"~*SemrushBot" 1;
|
||||
}
|
||||
# in the forge.eblu.me server block:
|
||||
if ($is_ai_bot) { return 403; }
|
||||
```
|
||||
|
||||
This catches ~85% of *current* traffic for a few lines of config. It is
|
||||
trivially evadable — a scraper need only spoof a browser UA — so it is a
|
||||
speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers
|
||||
(Googlebot, Bingbot) do honor it, and it documents intent.
|
||||
|
||||
#### 2b. Anubis proof-of-work gateway (the real wall)
|
||||
|
||||
[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that
|
||||
weighs each request with a browser-based proof-of-work challenge before passing
|
||||
it upstream. It was written for *exactly this scenario* — its author built it
|
||||
after Amazon's scraper took down their Git server — and is widely deployed in
|
||||
front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't
|
||||
run the challenge JS never reach the application; humans clear it once and
|
||||
proceed.
|
||||
|
||||
Why it fits BlumeOps better than the alternatives:
|
||||
|
||||
- **It attacks cost *and* availability at once.** Bots receive a few-KB
|
||||
challenge page instead of MB of git HTML (egress collapses) and never reach
|
||||
Forgejo (timeouts collapse). No other single lever does both.
|
||||
- **It stays in-house.** No third party terminates our TLS or sees our
|
||||
traffic.
|
||||
|
||||
Placement options:
|
||||
|
||||
| Where | Pros | Cons |
|
||||
|-------|------|------|
|
||||
| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) |
|
||||
| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy |
|
||||
|
||||
Leaning toward Caddy-side on indri for simplicity and uniform coverage, but
|
||||
this is the open design question for Tier 2. Anubis is MIT-licensed and the
|
||||
author has signalled a future move to an `equi-x`-based challenge, so pin a
|
||||
version and track upstream.
|
||||
|
||||
### Tier 3 — Move egress off Fly entirely (rejected)
|
||||
|
||||
A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare
|
||||
edge) would make this a non-problem on the cost axis: Cloudflare does not meter
|
||||
proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the
|
||||
"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would
|
||||
zero the egress bill and add bot defense.
|
||||
|
||||
**We are not doing this, on principle.** Cloudflare is a solid platform and a
|
||||
defensible engineering choice — but it already sits in front of an enormous
|
||||
fraction of the modern web, and routing BlumeOps through it would add one more
|
||||
site to the pile of the internet that one company can see and gate. BlumeOps
|
||||
deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale
|
||||
+ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in
|
||||
that doc). This is a values decision, not a technical one: we would rather pay
|
||||
a few dollars and run our own mitigation than centralize on Cloudflare.
|
||||
|
||||
It is also worth noting that **Tier 3 would not, by itself, fix the upstream
|
||||
timeouts** — free egress just means we'd stop *caring* that bots crawl, while
|
||||
they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is
|
||||
required regardless of where egress is billed.
|
||||
|
||||
## Summary
|
||||
|
||||
| Tier | Lever | Cost | Availability | Status |
|
||||
|------|-------|------|--------------|--------|
|
||||
| 1 | Black-hole `/mirrors/*` at edge | −~71% | big drop | **shipped** |
|
||||
| 2a | UA denylist on remaining repos | −most of the rest | further drop | planned |
|
||||
| 2b | Anubis PoW gateway | −near-total | near-total | planned |
|
||||
| 3 | Cloudflare Tunnel | −total | needs 2b anyway | **rejected (principle)** |
|
||||
|
||||
The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an
|
||||
infinite-URL surface that has no human audience.** Everything past Tier 1 is
|
||||
about defending the surface we *do* want public, in-house, without ceding
|
||||
control of our traffic to a third party.
|
||||
|
|
@ -376,6 +376,13 @@ Mitigations for dynamic services:
|
|||
- fail2ban on indri (see below) can block IPs showing abuse patterns
|
||||
- The break-glass shutoff remains the last resort
|
||||
|
||||
The most acute version of this in practice has been **AI scrapers**, which
|
||||
ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s
|
||||
infinite git-history URL space) into both a surprise egress bill and an
|
||||
effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered
|
||||
defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and
|
||||
why a Cloudflare Tunnel is *not* the chosen answer here.
|
||||
|
||||
If a publicly exposed dynamic service attracts targeted attacks or the
|
||||
home network bandwidth is impacted, consider migrating to Cloudflare
|
||||
Tunnel for enterprise-grade DDoS protection (requires DNS migration;
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue