blumeops/docs/explanation/ai-scraper-mitigation.md
Erich Blume a36a18aaa6
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00

9 KiB
Raw Blame History

title modified last-reviewed tags
AI Scraper Mitigation 2026-06-01 2026-06-01
explanation
fly-io
forgejo
security
networking

AI Scraper Mitigation on the Public Proxy

Note: This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.

How BlumeOps keeps AI crawlers from running up the expose-service-publicly egress bill and DoS-ing forgejo on indri.

The incident

A $29.60 Fly.io invoice arrived, nearly all of it a single line:

Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17

The iad (Ashburn) region is a red herring: the proxy machine runs in sjc, but Fly bills egress at the edge PoP nearest the client, so iad just means "the traffic went to clients on the US East Coast."

Tracing it through the nginx access logs (shipped to Loki via alloy):

Signal Value
Total proxy egress (30d) ~1.25 TB
Share that was forge.eblu.me 99.95%
Share of forge egress that was /mirrors/* ~71%
Share that was declared AI bots ~85%+
Top offenders Meta meta-externalagent (66% of bytes), OpenAI GPTBot (16%), Amazonbot, Bytespider
Forgejo 5xx (upstream timeouts) tens of thousands/day, spiking to 112k

The crawlers were walking forgejo's git-history browse endpoints — src/commit/<sha>, commits/, blame/, raw/commit/, plus .patch/.diff and ?page=N pagination. That URL space is effectively infinite: every file × every commit × every page, multiplied across every mirrored repo. A crawler that follows links never finishes, and every page is a cache MISS that both tunnels to indri and bills as egress.

Two distinct harms, not one:

  1. Cost — ~1.25 TB/mo of egress on a free-tier-ish proxy.
  2. Availability — the crawl alone generates ~400530k requests/day, enough to time out Forgejo regardless of how much RAM indri has. Moving egress elsewhere would not fix this; the crawl has to be throttled at the source.

robots.txt already Disallows /mirrors/, /user/, and archive/download paths — but meta-externalagent and GPTBot ignore it. For these agents, robots.txt is a dead letter, which is why edge enforcement is required.

The tiered plan

Tier 1 — Black-hole /mirrors/* (shipped)

The mirror repositories (tailscale, prometheus, mealie, paperless-ngx, …) are mirrors of already-public upstreams, kept for supply-chain control (see spork-strategy and the container/mirror story in why-gitops). They are consumed by CI, gilbert, and other tailnet clients over forge.ops.eblu.me. Their web UI on the public internet served no legitimate audience — only scrapers. So the proxy now returns 403 for anything under /mirrors/, pointing humans at the tailnet host:

location ^~ /mirrors/ {
    return 403 "Mirror repositories are tailnet-only  use forge.ops.eblu.me.\n";
}

The ^~ modifier matters: without it, the regex location blocks for static assets (*.css, *.js, release downloads) would match first and leak content under /mirrors/. ^~ tells nginx to stop at the prefix match and skip the regex round.

This is config, not bot-fighting — we simply stopped serving an infinite tarpit to the world. It removes ~71% of forge egress and a large share of the upstream timeouts, with zero impact on any human or tailnet consumer. It mirrors the existing tailnet-only blocks for /api/packages/ and /swagger.

The 403 is also a small act of public shaming. Blocked requests are served a "roll of dishonour" page (fly/naughty.html, status kept at 403 via error_page 403 /naughty.html) that names the offending operators and their share of the stolen bytes, and every response carries an X-Naughty-Scrapers header:

X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers

Petty? A little. But it costs nothing, documents why the block exists for the next person who hits it, and the page is a few KB versus the megabytes of git HTML the crawlers were taking.

Trade-off accepted: mirror release-artifact downloads over WAN now also 403. Legitimate consumers already pull these over the tailnet, and the public exposure was the same crawl liability, so this is intentional.

Tier 2 — Defend the repos that stay public (planned)

/eblume/* is intentionally public (a public profile is a feature). But the same git-history endpoints are still a tarpit there, just lower-volume. Two layers, in increasing order of effort and effectiveness:

2a. User-agent denylist (cheap, evadable)

Block the declared AI crawlers at the edge regardless of path:

# Illustrative — not yet deployed.
map $http_user_agent $is_ai_bot {
    default                 0;
    "~*meta-externalagent"  1;
    "~*GPTBot"              1;
    "~*ClaudeBot"          1;
    "~*Amazonbot"          1;
    "~*Bytespider"         1;
    "~*SemrushBot"         1;
}
# in the forge.eblu.me server block:
if ($is_ai_bot) { return 403; }

This catches ~85% of current traffic for a few lines of config. It is trivially evadable — a scraper need only spoof a browser UA — so it is a speed-bump, not a wall. Keep robots.txt too: well-behaved crawlers (Googlebot, Bingbot) do honor it, and it documents intent.

2b. Anubis proof-of-work gateway (the real wall)

Anubis is a Go reverse proxy that weighs each request with a browser-based proof-of-work challenge before passing it upstream. It was written for exactly this scenario — its author built it after Amazon's scraper took down their Git server — and is widely deployed in front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't run the challenge JS never reach the application; humans clear it once and proceed.

Why it fits BlumeOps better than the alternatives:

  • It attacks cost and availability at once. Bots receive a few-KB challenge page instead of MB of git HTML (egress collapses) and never reach Forgejo (timeouts collapse). No other single lever does both.
  • It stays in-house. No third party terminates our TLS or sees our traffic.

Placement options:

Where Pros Cons
On indri, between [[caddy Caddy]] and Forgejo Protects every path and every entry (WAN and tailnet); one config
On the Fly proxy machine, in front of nginx Challenge served at the edge — bots never even tunnel to indri Fly VM is small (512 MB); another moving part in the boot sequence alongside tailscaled/nginx/fail2ban/Alloy

Leaning toward Caddy-side on indri for simplicity and uniform coverage, but this is the open design question for Tier 2. Anubis is MIT-licensed and the author has signalled a future move to an equi-x-based challenge, so pin a version and track upstream.

Tier 3 — Move egress off Fly entirely (rejected)

A #The incident Tunnel (cloudflared on indri → Cloudflare edge) would make this a non-problem on the cost axis: Cloudflare does not meter proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the "block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would zero the egress bill and add bot defense.

We are not doing this, on principle. Cloudflare is a solid platform and a defensible engineering choice — but it already sits in front of an enormous fraction of the modern web, and routing BlumeOps through it would add one more site to the pile of the internet that one company can see and gate. BlumeOps deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale

  • Caddy]], DNS at gandi — see the "no Cloudflare dependency" line in that doc). This is a values decision, not a technical one: we would rather pay a few dollars and run our own mitigation than centralize on Cloudflare.

It is also worth noting that Tier 3 would not, by itself, fix the upstream timeouts — free egress just means we'd stop caring that bots crawl, while they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is required regardless of where egress is billed.

Summary

Tier Lever Cost Availability Status
1 Black-hole /mirrors/* at edge ~71% big drop shipped
2a UA denylist on remaining repos most of the rest further drop planned
2b Anubis PoW gateway near-total near-total planned
3 Cloudflare Tunnel total needs 2b anyway rejected (principle)

The guiding insight: the cheapest, lowest-risk mitigation is to not serve an infinite-URL surface that has no human audience. Everything past Tier 1 is about defending the surface we do want public, in-house, without ceding control of our traffic to a third party.