blumeops

Author	SHA1	Message	Date
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	1d62653871	Fix forge.eblu.me static assets by adding missing Host header All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m26s Details The static asset cache block (css/js/png/etc) was missing proxy_set_header Host, so Caddy received "forge.eblu.me" instead of "forge.ops.eblu.me" and couldn't route the request. HTML loaded fine because the main location / block had the header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 16:00:56 -07:00
Erich Blume	12b2786ca2	Route Fly proxy through Caddy on indri for direct WireGuard peering All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m59s Details Tailscale Ingress pods in k8s can't establish direct WireGuard connections (stuck behind pod-network NAT → DERP relay → 20s latency). Indri's host-level Tailscale CAN peer directly with Fly. Change all nginx upstreams to route through Caddy on indri instead of per-service Tailscale Ingress endpoints. Tag indri as flyio-target in the Tailscale ACL so the Fly proxy can reach it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:40:20 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	54b1cee950	Fix Connection header: only send 'upgrade' for WebSocket requests Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 1m35s Details Was sending Connection: upgrade on every proxied request, which is semantically wrong for normal HTTP traffic. Use a map to conditionally send 'upgrade' only when the client requests a WebSocket switch, 'close' otherwise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:27:40 -07:00
Erich Blume	1631e11137	Add /user/ to forge robots.txt exclusion All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m47s Details Crawlers follow auth redirects to /user/login which is pointless for them. Saves round-trips for both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:34:24 -07:00
Erich Blume	7a42aeb77c	Mitigate Forgejo archive endpoint DoS from crawler abuse All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m35s Details Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers: - Redirect archive requests to tailnet at Fly proxy (302) - Expand robots.txt: block /users/, //archive/, //releases/download/ - Cache release artifact downloads at nginx (immutable, 7d TTL) - Enable [cron.archive_cleanup] with 2h TTL and run-at-start Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:21:22 -07:00
Erich Blume	7f6bbdc82c	Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/ All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m19s Details Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing down the Forgejo instance. Serve robots.txt directly from nginx to disallow /mirrors/ while leaving eblume/* accessible to crawlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:39:48 -07:00
Erich Blume	a75f28e073	Fix fly.io proxy rate limit to key on real client IP All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m24s Details The general rate limit zone used $binary_remote_addr (Fly's internal proxy IP), causing all external clients to share one bucket. Switch to $http_fly_client_ip to match forge_auth's correct behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:00:33 -07:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	9717863f65	Update CV release to v1.0.3, add X-Clacks-Overhead header (#176 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m5s Details ## Summary - Update CV release URL from v1.0.2 to v1.0.3 - Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config ## Deployment and Testing - [ ] Sync CV app: `argocd app sync cv` - [ ] Verify CV is serving v1.0.3 content - [ ] Deploy fly proxy (workflow or `mise run fly-deploy`) - [ ] Verify header: `curl -sI https://docs.eblu.me \| grep -i clacks` - [ ] Verify header: `curl -sI https://cv.eblu.me \| grep -i clacks` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176	2026-02-12 17:08:22 -08:00
Erich Blume	df372fccb6	Expose CV publicly at cv.eblu.me (#173 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m57s Details ## Summary - Add nginx server block for `cv.eblu.me` (static site, same pattern as docs) - Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`) - Add `cv.eblu.me` cert to `fly-setup` mise task - Tag CV Tailscale ingress with `tag:flyio-target` for ACL access - Remove `/_error` test endpoint from docs proxy ## Deployment and Testing - [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv` - [ ] `fly certs add cv.eblu.me -a blumeops-proxy` - [ ] `mise run fly-deploy` - [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/` - [ ] `mise run dns-preview` then `mise run dns-up` - [ ] Verify live: `curl -I https://cv.eblu.me` - [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173	2026-02-12 14:05:00 -08:00
Erich Blume	4ee643a81d	Serve friendly error page when Fly.io proxy upstreams are unreachable (#133 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m50s Details ## Summary - Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.) - Stale cache is still served first when available (`proxy_cache_use_stale` takes priority) - Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams - `proxy_intercept_errors on` also catches error responses returned by the upstream itself ## Files Changed - `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo) - `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors` - `fly/Dockerfile` — COPY error.html into image ## Test Plan - [ ] Deploy to Fly.io - [ ] Visit `docs.eblu.me/_error` to verify the page renders - [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133	2026-02-09 12:01:24 -08:00
Erich Blume	959b6842bc	Zero-downtime Fly.io deploys (#132 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m40s Details ## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132	2026-02-09 11:34:19 -08:00
Erich Blume	bd61da4f85	Fix 502 errors during Fly.io proxy deploys (#131 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m20s Details ## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131	2026-02-09 11:07:36 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Erich Blume	c6f8fcd346	Fix fly-deploy WARNING by starting nginx before Tailscale (#128 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m3s Details ## Summary - Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy` - Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time - DNS results cached 30s per worker — no per-request lookup overhead ## Context The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy. ## Test plan - [ ] Merge and let the deploy-fly workflow trigger - [ ] Check runner logs for absence of the WARNING - [ ] Verify `docs.eblu.me` still serves correctly - [ ] Verify `/healthz` still passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128	2026-02-09 07:01:58 -08:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00

19 commits