C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers

A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite git-history URL space and timing out Forgejo. robots.txt already disallowed /mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~ to beat the regex asset locations), served as a roll-of-dishonour page with an X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection are documented in docs/explanation/ai-scraper-mitigation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00 · 2026-06-01 20:52:20 -07:00 · a36a18aaa6
commit a36a18aaa6
parent e0064de83d
7 changed files with 302 additions and 0 deletions
--- a/fly/Dockerfile
+++ b/fly/Dockerfile
@ -25,6 +25,7 @@ COPY fail2ban/action.d/nginx-deny.conf /etc/fail2ban/action.d/nginx-deny.conf

 COPY nginx.conf /etc/nginx/nginx.conf
 COPY error.html /usr/share/nginx/html/error.html
+COPY naughty.html /usr/share/nginx/html/naughty.html
 COPY alloy.river /etc/alloy/config.alloy
 COPY start.sh /start.sh
 RUN chmod +x /start.sh
--- a/fly/naughty.html
+++ b/fly/naughty.html
@ -0,0 +1,64 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <meta name="robots" content="noindex, nofollow">
+  <title>403 · Roll of Dishonour</title>
+  <style>
+    :root { color-scheme: dark; }
+    body {
+      margin: 0; min-height: 100vh; display: grid; place-items: center;
+      background: #14110f; color: #e8e2da;
+      font: 16px/1.6 ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
+    }
+    main { max-width: 44rem; padding: 2.5rem 1.5rem; }
+    h1 { font-size: 1.6rem; margin: 0 0 .25rem; color: #f2c14e; }
+    .sub { color: #9b948b; margin: 0 0 1.75rem; }
+    table { width: 100%; border-collapse: collapse; margin: 1.25rem 0; }
+    th, td { text-align: left; padding: .4rem .6rem; border-bottom: 1px solid #2a2521; }
+    th { color: #9b948b; font-weight: 600; }
+    td.share { color: #f2c14e; text-align: right; font-variant-numeric: tabular-nums; }
+    .name { color: #e8867a; }
+    a { color: #7fb3d5; }
+    footer { margin-top: 2rem; color: #5c574f; font-size: .85rem; }
+  </style>
+</head>
+<body>
+  <main>
+    <h1>🪤 403 — you walked into the scraper trap</h1>
+    <p class="sub">These are mirror repositories. They are tailnet-only.</p>
+
+    <p>
+      This path used to serve the web UI for mirrors of public upstream
+      projects. It exists for supply-chain control, not for crawling. A
+      <code>robots.txt</code> politely disallowed <code>/mirrors/</code>.
+      A pack of AI scrapers ignored it, walked the infinite git-history URL
+      space, and ran up <strong>~1.25&nbsp;TB</strong> of egress and a real
+      money bill in a single month — while timing out the server for everyone
+      else.
+    </p>
+
+    <p>So <code>/mirrors/</code> is closed at the edge now. Roll of dishonour,
+      by share of the bytes they stole:</p>
+
+    <table>
+      <thead><tr><th>Operator</th><th>User-Agent</th><th class="share">Bytes</th></tr></thead>
+      <tbody>
+        <tr><td class="name">Meta</td><td><code>meta-externalagent</code></td><td class="share">66%</td></tr>
+        <tr><td class="name">OpenAI</td><td><code>GPTBot</code></td><td class="share">16%</td></tr>
+        <tr><td class="name">Amazon</td><td><code>Amazonbot</code></td><td class="share">3%</td></tr>
+        <tr><td class="name">ByteDance</td><td><code>Bytespider</code></td><td class="share">1%</td></tr>
+      </tbody>
+    </table>
+
+    <p>
+      If you are a human who actually wanted these mirrors, they are reachable
+      from the tailnet at <code>forge.ops.eblu.me</code>. If you are a crawler:
+      read the <code>robots.txt</code> next time. We left you a header, too.
+    </p>
+
+    <footer>GNU Terry Pratchett</footer>
+  </main>
+</body>
+</html>
--- a/fly/nginx.conf
+++ b/fly/nginx.conf
@ -215,6 +215,33 @@ http {
            return 403 "API documentation is only available at forge.ops.eblu.me (tailnet).\n";
        }

+        # Black-hole the mirror repositories on WAN. These are mirrors of
+        # already-public upstreams (tailscale, prometheus, mealie, …) kept
+        # for supply-chain control; CI, gilbert, and tailnet clients consume
+        # them via forge.ops.eblu.me. Their web UI served no public purpose
+        # but AI scrapers, which crawled the near-infinite git-history URL
+        # space (src/commit, commits, blame, raw) and drove ~70% of Fly
+        # egress (1.24 TB/30d → a surprise bill) plus enough upstream load to
+        # time out Forgejo. robots.txt already Disallows /mirrors/, but
+        # meta-externalagent and GPTBot ignore it — so enforce at the edge.
+        # `^~` makes this win over the regex locations below (e.g. *.css), so
+        # static assets under /mirrors/ can't leak through. We also name and
+        # shame: blocked requests get a "roll of dishonour" page (403 status
+        # preserved) and an X-Naughty-Scrapers header. See
+        # docs/explanation/ai-scraper-mitigation.md.
+        location ^~ /mirrors/ {
+            error_page 403 /naughty.html;
+            return 403;
+        }
+
+        # Roll of dishonour — served on the /mirrors/ 403, status kept at 403.
+        location = /naughty.html {
+            internal;
+            root /usr/share/nginx/html;
+            add_header X-Naughty-Scrapers "OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers" always;
+            add_header X-Clacks-Overhead "GNU Terry Pratchett" always;
+        }
+
        # Redirect archive endpoints to tailnet — archive requests generate full
        # git bundles on demand. Unauthenticated crawlers hitting unique commit
        # SHAs cause unbounded CPU and disk usage (DoS vector). Legitimate users