diff --git a/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
new file mode 100644
index 0000000..246fedb
--- /dev/null
+++ b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
@@ -0,0 +1 @@
+Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.
diff --git a/docs/changelog.d/+forge-mirrors-blackhole.infra.md b/docs/changelog.d/+forge-mirrors-blackhole.infra.md
new file mode 100644
index 0000000..29a5e6a
--- /dev/null
+++ b/docs/changelog.d/+forge-mirrors-blackhole.infra.md
@@ -0,0 +1 @@
+Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.
diff --git a/docs/explanation/ai-scraper-mitigation.md b/docs/explanation/ai-scraper-mitigation.md
new file mode 100644
index 0000000..fe4ba3d
--- /dev/null
+++ b/docs/explanation/ai-scraper-mitigation.md
@@ -0,0 +1,201 @@
+---
+title: AI Scraper Mitigation
+modified: 2026-06-01
+last-reviewed: 2026-06-01
+tags:
+  - explanation
+  - fly-io
+  - forgejo
+  - security
+  - networking
+---
+
+# AI Scraper Mitigation on the Public Proxy
+
+> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.
+
+How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]].
+
+## The incident
+
+A $29.60 Fly.io invoice arrived, nearly all of it a single line:
+
+```
+Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17
+```
+
+The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`,
+but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means
+"the traffic went to clients on the US East Coast."
+
+Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]):
+
+| Signal | Value |
+|--------|-------|
+| Total proxy egress (30d) | ~1.25 TB |
+| Share that was `forge.eblu.me` | **99.95%** |
+| Share of forge egress that was `/mirrors/*` | **~71%** |
+| Share that was declared AI bots | **~85%+** |
+| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider |
+| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k |
+
+The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints —
+`src/commit/<sha>`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff`
+and `?page=N` pagination. That URL space is effectively **infinite**: every
+file × every commit × every page, multiplied across every mirrored repo. A
+crawler that follows links never finishes, and every page is a cache `MISS`
+that both tunnels to indri *and* bills as egress.
+
+Two distinct harms, not one:
+
+1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy.
+2. **Availability** — the crawl alone generates ~400–530k requests/day,
+   enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving
+   egress elsewhere would *not* fix this; the crawl has to be throttled at the
+   source.
+
+`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download
+paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents,
+`robots.txt` is a dead letter, which is why edge enforcement is required.
+
+## The tiered plan
+
+### Tier 1 — Black-hole `/mirrors/*` (shipped)
+
+The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`,
+…) are mirrors of *already-public upstreams*, kept for supply-chain control
+(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They
+are consumed by CI, gilbert, and other tailnet clients over
+`forge.ops.eblu.me`. Their web UI on the public internet served **no
+legitimate audience** — only scrapers. So the proxy now returns `403` for
+anything under `/mirrors/`, pointing humans at the tailnet host:
+
+```nginx
+location ^~ /mirrors/ {
+    return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n";
+}
+```
+
+The `^~` modifier matters: without it, the regex `location` blocks for static
+assets (`*.css`, `*.js`, release downloads) would match first and leak content
+under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the
+regex round.
+
+This is config, not bot-fighting — we simply stopped serving an infinite
+tarpit to the world. It removes ~71% of forge egress and a large share of the
+upstream timeouts, with zero impact on any human or tailnet consumer. It
+mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`.
+
+The `403` is also a small act of public shaming. Blocked requests are served a
+"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via
+`error_page 403 /naughty.html`) that names the offending operators and their
+share of the stolen bytes, and every response carries an `X-Naughty-Scrapers`
+header:
+
+```
+X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers
+```
+
+Petty? A little. But it costs nothing, documents *why* the block exists for the
+next person who hits it, and the page is a few KB versus the megabytes of git
+HTML the crawlers were taking.
+
+**Trade-off accepted:** mirror release-artifact downloads over WAN now also
+`403`. Legitimate consumers already pull these over the tailnet, and the public
+exposure was the same crawl liability, so this is intentional.
+
+### Tier 2 — Defend the repos that *stay* public (planned)
+
+`/eblume/*` is intentionally public (a public profile is a feature). But the
+same git-history endpoints are still a tarpit there, just lower-volume. Two
+layers, in increasing order of effort and effectiveness:
+
+#### 2a. User-agent denylist (cheap, evadable)
+
+Block the declared AI crawlers at the edge regardless of path:
+
+```nginx
+# Illustrative — not yet deployed.
+map $http_user_agent $is_ai_bot {
+    default                 0;
+    "~*meta-externalagent"  1;
+    "~*GPTBot"              1;
+    "~*ClaudeBot"          1;
+    "~*Amazonbot"          1;
+    "~*Bytespider"         1;
+    "~*SemrushBot"         1;
+}
+# in the forge.eblu.me server block:
+if ($is_ai_bot) { return 403; }
+```
+
+This catches ~85% of *current* traffic for a few lines of config. It is
+trivially evadable — a scraper need only spoof a browser UA — so it is a
+speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers
+(Googlebot, Bingbot) do honor it, and it documents intent.
+
+#### 2b. Anubis proof-of-work gateway (the real wall)
+
+[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that
+weighs each request with a browser-based proof-of-work challenge before passing
+it upstream. It was written for *exactly this scenario* — its author built it
+after Amazon's scraper took down their Git server — and is widely deployed in
+front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't
+run the challenge JS never reach the application; humans clear it once and
+proceed.
+
+Why it fits BlumeOps better than the alternatives:
+
+- **It attacks cost *and* availability at once.** Bots receive a few-KB
+  challenge page instead of MB of git HTML (egress collapses) and never reach
+  Forgejo (timeouts collapse). No other single lever does both.
+- **It stays in-house.** No third party terminates our TLS or sees our
+  traffic.
+
+Placement options:
+
+| Where | Pros | Cons |
+|-------|------|------|
+| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) |
+| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy |
+
+Leaning toward Caddy-side on indri for simplicity and uniform coverage, but
+this is the open design question for Tier 2. Anubis is MIT-licensed and the
+author has signalled a future move to an `equi-x`-based challenge, so pin a
+version and track upstream.
+
+### Tier 3 — Move egress off Fly entirely (rejected)
+
+A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare
+edge) would make this a non-problem on the cost axis: Cloudflare does not meter
+proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the
+"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would
+zero the egress bill and add bot defense.
+
+**We are not doing this, on principle.** Cloudflare is a solid platform and a
+defensible engineering choice — but it already sits in front of an enormous
+fraction of the modern web, and routing BlumeOps through it would add one more
+site to the pile of the internet that one company can see and gate. BlumeOps
+deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale
++ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in
+that doc). This is a values decision, not a technical one: we would rather pay
+a few dollars and run our own mitigation than centralize on Cloudflare.
+
+It is also worth noting that **Tier 3 would not, by itself, fix the upstream
+timeouts** — free egress just means we'd stop *caring* that bots crawl, while
+they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is
+required regardless of where egress is billed.
+
+## Summary
+
+| Tier | Lever | Cost | Availability | Status |
+|------|-------|------|--------------|--------|
+| 1 | Black-hole `/mirrors/*` at edge | −~71% | big drop | **shipped** |
+| 2a | UA denylist on remaining repos | −most of the rest | further drop | planned |
+| 2b | Anubis PoW gateway | −near-total | near-total | planned |
+| 3 | Cloudflare Tunnel | −total | needs 2b anyway | **rejected (principle)** |
+
+The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an
+infinite-URL surface that has no human audience.** Everything past Tier 1 is
+about defending the surface we *do* want public, in-house, without ceding
+control of our traffic to a third party.
diff --git a/docs/tutorials/expose-service-publicly.md b/docs/tutorials/expose-service-publicly.md
index 886cad4..65af611 100644
--- a/docs/tutorials/expose-service-publicly.md
+++ b/docs/tutorials/expose-service-publicly.md
@@ -376,6 +376,13 @@ Mitigations for dynamic services:
 - fail2ban on indri (see below) can block IPs showing abuse patterns
 - The break-glass shutoff remains the last resort
 
+The most acute version of this in practice has been **AI scrapers**, which
+ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s
+infinite git-history URL space) into both a surprise egress bill and an
+effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered
+defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and
+why a Cloudflare Tunnel is *not* the chosen answer here.
+
 If a publicly exposed dynamic service attracts targeted attacks or the
 home network bandwidth is impacted, consider migrating to Cloudflare
 Tunnel for enterprise-grade DDoS protection (requires DNS migration;
diff --git a/fly/Dockerfile b/fly/Dockerfile
index d4e7a18..406c849 100644
--- a/fly/Dockerfile
+++ b/fly/Dockerfile
@@ -25,6 +25,7 @@ COPY fail2ban/action.d/nginx-deny.conf /etc/fail2ban/action.d/nginx-deny.conf
 
 COPY nginx.conf /etc/nginx/nginx.conf
 COPY error.html /usr/share/nginx/html/error.html
+COPY naughty.html /usr/share/nginx/html/naughty.html
 COPY alloy.river /etc/alloy/config.alloy
 COPY start.sh /start.sh
 RUN chmod +x /start.sh
diff --git a/fly/naughty.html b/fly/naughty.html
new file mode 100644
index 0000000..d899171
--- /dev/null
+++ b/fly/naughty.html
@@ -0,0 +1,64 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <meta name="robots" content="noindex, nofollow">
+  <title>403 · Roll of Dishonour</title>
+  <style>
+    :root { color-scheme: dark; }
+    body {
+      margin: 0; min-height: 100vh; display: grid; place-items: center;
+      background: #14110f; color: #e8e2da;
+      font: 16px/1.6 ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
+    }
+    main { max-width: 44rem; padding: 2.5rem 1.5rem; }
+    h1 { font-size: 1.6rem; margin: 0 0 .25rem; color: #f2c14e; }
+    .sub { color: #9b948b; margin: 0 0 1.75rem; }
+    table { width: 100%; border-collapse: collapse; margin: 1.25rem 0; }
+    th, td { text-align: left; padding: .4rem .6rem; border-bottom: 1px solid #2a2521; }
+    th { color: #9b948b; font-weight: 600; }
+    td.share { color: #f2c14e; text-align: right; font-variant-numeric: tabular-nums; }
+    .name { color: #e8867a; }
+    a { color: #7fb3d5; }
+    footer { margin-top: 2rem; color: #5c574f; font-size: .85rem; }
+  </style>
+</head>
+<body>
+  <main>
+    <h1>🪤 403 — you walked into the scraper trap</h1>
+    <p class="sub">These are mirror repositories. They are tailnet-only.</p>
+
+    <p>
+      This path used to serve the web UI for mirrors of public upstream
+      projects. It exists for supply-chain control, not for crawling. A
+      <code>robots.txt</code> politely disallowed <code>/mirrors/</code>.
+      A pack of AI scrapers ignored it, walked the infinite git-history URL
+      space, and ran up <strong>~1.25&nbsp;TB</strong> of egress and a real
+      money bill in a single month — while timing out the server for everyone
+      else.
+    </p>
+
+    <p>So <code>/mirrors/</code> is closed at the edge now. Roll of dishonour,
+      by share of the bytes they stole:</p>
+
+    <table>
+      <thead><tr><th>Operator</th><th>User-Agent</th><th class="share">Bytes</th></tr></thead>
+      <tbody>
+        <tr><td class="name">Meta</td><td><code>meta-externalagent</code></td><td class="share">66%</td></tr>
+        <tr><td class="name">OpenAI</td><td><code>GPTBot</code></td><td class="share">16%</td></tr>
+        <tr><td class="name">Amazon</td><td><code>Amazonbot</code></td><td class="share">3%</td></tr>
+        <tr><td class="name">ByteDance</td><td><code>Bytespider</code></td><td class="share">1%</td></tr>
+      </tbody>
+    </table>
+
+    <p>
+      If you are a human who actually wanted these mirrors, they are reachable
+      from the tailnet at <code>forge.ops.eblu.me</code>. If you are a crawler:
+      read the <code>robots.txt</code> next time. We left you a header, too.
+    </p>
+
+    <footer>GNU Terry Pratchett</footer>
+  </main>
+</body>
+</html>
diff --git a/fly/nginx.conf b/fly/nginx.conf
index 570e6c9..ec35774 100644
--- a/fly/nginx.conf
+++ b/fly/nginx.conf
@@ -215,6 +215,33 @@ http {
             return 403 "API documentation is only available at forge.ops.eblu.me (tailnet).\n";
         }
 
+        # Black-hole the mirror repositories on WAN. These are mirrors of
+        # already-public upstreams (tailscale, prometheus, mealie, …) kept
+        # for supply-chain control; CI, gilbert, and tailnet clients consume
+        # them via forge.ops.eblu.me. Their web UI served no public purpose
+        # but AI scrapers, which crawled the near-infinite git-history URL
+        # space (src/commit, commits, blame, raw) and drove ~70% of Fly
+        # egress (1.24 TB/30d → a surprise bill) plus enough upstream load to
+        # time out Forgejo. robots.txt already Disallows /mirrors/, but
+        # meta-externalagent and GPTBot ignore it — so enforce at the edge.
+        # `^~` makes this win over the regex locations below (e.g. *.css), so
+        # static assets under /mirrors/ can't leak through. We also name and
+        # shame: blocked requests get a "roll of dishonour" page (403 status
+        # preserved) and an X-Naughty-Scrapers header. See
+        # docs/explanation/ai-scraper-mitigation.md.
+        location ^~ /mirrors/ {
+            error_page 403 /naughty.html;
+            return 403;
+        }
+
+        # Roll of dishonour — served on the /mirrors/ 403, status kept at 403.
+        location = /naughty.html {
+            internal;
+            root /usr/share/nginx/html;
+            add_header X-Naughty-Scrapers "OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers" always;
+            add_header X-Clacks-Overhead "GNU Terry Pratchett" always;
+        }
+
         # Redirect archive endpoints to tailnet — archive requests generate full
         # git bundles on demand. Unauthenticated crawlers hitting unique commit
         # SHAs cause unbounded CPU and disk usage (DoS vector). Legitimate users

Operator	User-Agent	Bytes
Meta	`meta-externalagent`	66%
OpenAI	`GPTBot`	16%
Amazon	`Amazonbot`	3%
ByteDance	`Bytespider`	1%