blumeops

Author	SHA1	Message	Date
Erich Blume	fb6067b620	C1: shower-specific rate-limit zone for venue-wifi NAT Default `general` zone (10r/s burst=20) is tuned for internet drive-by traffic. At the party, 30 guests scanning the splash QR from one venue-wifi NAT'd public IP would each fetch HTML + ~5 static assets within a few seconds — easily clearing burst=20, and the second-wave guests would see 503 with no auto-retry. New shower_general zone (50r/s burst=200) absorbs that simultaneous- load spike. Exploit scanners still trip it: the 45.88.138.44 burst we already saw in Loki fired ~30 req in 2s, well above the new sustained 50r/s when extrapolated, and burst=200 is still a hard cap on instantaneous spikes. Self-healing: `limit_req` is a token bucket — no persistent ban, nothing to manually flush. A guest who trips it auto-recovers within ~1s; tuning here is about not tripping it on legit traffic in the first place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:44:22 -07:00
Erich Blume	2d38418e6e	C1: close forge package leak at the fly edge forge.eblu.me's package registry (/api/packages/* and /api/v1/packages/) served anonymous reads to the world even for private-repo releases — Forgejo's per-user visibility treats packages as world-readable when the owner's Visibility is Public, and we keep eblume Public so the profile page stays open. The sdist downloads include full source trees of private repos; that's the leak. The fix is to keep the user public but block /api/packages/ and /api/v1/packages/* at the proxy edge. forge.ops.eblu.me (tailnet) is untouched, so CI workflows + gilbert's uv + the nix-container-builder still work — they just need to use the tailnet hostname. Three consumers updated to forge.ops.eblu.me: - containers/shower/default.nix (the FOD pip --extra-index-url) - ansible/roles/cv/defaults/main.yml (cv_release_url for generic package) - chezmoi-tracked fish dotfiles (devpi.fish + conf.d/pypi.fish) — edited in chezmoi source, user will apply separately The blumeops repo had no other forge-pypi consumers (audited: workers, runner-job-image, ansible roles, container builds). Doc references in changelog fragments + comments left as-is — they describe history. The proper long-term fix is to move private packages to a Limited- visibility Forgejo org instead of relying on a proxy-side block (see queued Todoist for the migration plan). Edge block stays as defense in depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:06:48 -07:00
Erich Blume	473bc78181	C1: bump shower to v1.0.2 (WhiteNoise upstreamed); cache static on fly App v1.0.2 ships WhiteNoise for /static/ and /media/, so the blumeops-side workaround is no longer needed: - containers/shower/default.nix: drop the WhiteNoise pip dep + the middleware-injection block from local_settings. The shim is back to just path overrides (DATABASES.NAME, MEDIA_ROOT, STATIC_ROOT). - version → 1.0.2, outputHash → fakeHash for re-pinning. - service-versions.yaml mirrored. fly/nginx.conf: cache /static/ (1y) and /media/ (1d) per location for shower.eblu.me. /static/ filenames are content-hashed thanks to CompressedManifestStaticFilesStorage so a year is safe and invalidation is automatic on the next collectstatic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 12:06:28 -07:00
Erich Blume	702592bcc9	C1: bump shower to v1.0.1; collapse WAN admin to tailnet-only PR review caught that we didn't need an admin login surface on WAN. App v1.0.1 adds DJANGO_PUBLIC_URL_BASE so QR codes generated from /host/ (now tailnet-only) still point at shower.eblu.me for guest phones — that closes the loop and lets us strip the WAN admin surface entirely. Container: - bump version to 1.0.1 - outputHash → fakeHash (build will print the real one) - entrypoint still does migrate + collectstatic before gunicorn — the app is small enough that auto-migration is fine Manifests: - configmap adds DJANGO_PUBLIC_URL_BASE=https://shower.eblu.me Fly nginx (shower.eblu.me): - drop the /admin/(login\|logout) carveout - 403 anything under /admin/ AND /host/ with a "tailnet only" pointer - drop the shower_auth limit_req zone and \$shower_banned geo - drop the shower-admin-login fail2ban filter + jail - drop the shower-deny.conf touch from start.sh Docs: - rename how-to docs/how-to/operations/shower-app.md → shower-on-ringtail.md (mirrors cv-on-indri / docs-on-indri) - new reference card docs/reference/services/shower-app.md per PR review comment 2 (≈30s read; quick facts + cross-links) - rewrite Defense layers section: collapses to general rate limit + django-axes on the tailnet-side login (the only credential surface) - rewrite the .infra.md changelog fragment to match - add a 'Create the admin user' step (kubectl exec createsuperuser) so first-time deploys aren't locked out The nginx-deny action's per-jail \`nginx_deny_file\` generalization stays — harmless future-proofing for the next public service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:23:40 -07:00
Erich Blume	6e37abda5d	C1: deploy adelaide-baby-shower-app to ringtail k3s Adds the Adelaide / Heidi / Addie baby shower app — a Django guest splash, raffle picker, and prize-assignment console — on ringtail k3s. Public landing at shower.eblu.me (via fly proxy), tailnet admin at shower.ops.eblu.me. App source: forge.eblu.me/eblume/adelaide-baby-shower-app, wheel-published to the Forgejo Packages PyPI index. Manifests under argocd/manifests/shower/: NFS-backed PVC for /app/media, local-path PVC for SQLite, ExternalSecret pulling DJANGO_SECRET_KEY from 1Password (item "Shower (blumeops)"), Tailscale ProxyGroup ingress. Defense-in-depth for the public surface: - /admin/ blocked at the fly edge except /admin/login/ and /admin/logout/ - shower_auth rate limit on the login path - new fail2ban filter+jail with a per-service shower-deny.conf (nginx-deny action generalized to accept nginx_deny_file) - django-axes (5 / 1h) keyed on (username, ip_address) Plus: Caddy route on indri, Pulumi gandi CNAME, Grafana APM dashboard mirroring docs-apm.json, runbook at how-to/operations/shower-app.md, and a service-versions entry. X-Clacks-Overhead set on the new server block — GNU Terry Pratchett. Build: containers/shower/default.nix uses dockerTools to ship a nixpkgs Python plus a startup wrapper that installs the wheel into /app/data/.venv on first boot and execs gunicorn. Lets the wheel come from forge PyPI without pinning hashes for every transitive dep. Prerequisites tracked in the runbook (not yet executed): - NFS share sifaka:/volume1/shower (manual Synology step) - 1Password item "Shower (blumeops)" with secret-key field - container build via `mise run container-build-and-release shower` - Pulumi dns-up after merge - fly certs add shower.eblu.me Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 08:14:12 -07:00
Erich Blume	f6e392b80c	C1: SHA-pin tooling dependencies (2026-04 cycle) (#344 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m45s Details ## Summary Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere. ## Changes - prek hooks: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12). - fly/Dockerfile: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0). - mise-tasks: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1. - prek `additional_dependencies`: ansible-lint==26.4.0, ansible-core==2.20.5. - taplo-lint: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch. - docs/update-tooling-dependencies: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB). Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change. ## Test plan - [x] `prek run --all-files` passes after `prek clean` - [x] `deploy-fly` workflow builds and deploys the new fly image on merge - [x] `fly status -a blumeops-proxy` healthy after deploy - [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly Reviewed-on: #344	2026-04-30 16:51:43 -07:00
Erich Blume	1d62653871	Fix forge.eblu.me static assets by adding missing Host header All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m26s Details The static asset cache block (css/js/png/etc) was missing proxy_set_header Host, so Caddy received "forge.eblu.me" instead of "forge.ops.eblu.me" and couldn't route the request. HTML loaded fine because the main location / block had the header. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 16:00:56 -07:00
Erich Blume	12b2786ca2	Route Fly proxy through Caddy on indri for direct WireGuard peering All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m59s Details Tailscale Ingress pods in k8s can't establish direct WireGuard connections (stuck behind pod-network NAT → DERP relay → 20s latency). Indri's host-level Tailscale CAN peer directly with Fly. Change all nginx upstreams to route through Caddy on indri instead of per-service Tailscale Ingress endpoints. Tag indri as flyio-target in the Tailscale ACL so the Fly proxy can reach it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:40:20 -07:00
Erich Blume	bca4c2bede	Expose Tailscale WireGuard UDP port on Fly proxy Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 1m33s Details Enable direct peer-to-peer WireGuard connections by pinning tailscaled to port 41641 and exposing it as a UDP service. Without this, all traffic routes through Tailscale DERP relays causing 20+ second latency. Requires dedicated IPv4 (allocated: 168.220.82.221). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:17:03 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	54b1cee950	Fix Connection header: only send 'upgrade' for WebSocket requests Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 1m35s Details Was sending Connection: upgrade on every proxied request, which is semantically wrong for normal HTTP traffic. Use a map to conditionally send 'upgrade' only when the client requests a WebSocket switch, 'close' otherwise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:27:40 -07:00
Erich Blume	d7af004842	Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m53s Details - Enable Forgejo /metrics endpoint (app.ini [metrics] section) - Add Alloy scrape target for Forgejo metrics on indri - Add upstream_response_time histogram to Fly proxy Alloy config - Replace single p95 panel with p50/p90/p99 + upstream breakdown filtered to forge.eblu.me host Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:05:59 -07:00
Erich Blume	8fccbda573	Extend Fly proxy latency histogram buckets to 60s All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m29s Details Previous max bucket was 10s — all slower requests collapsed into +Inf, making p50/p90/p99 unreadable during the Forgejo archive DoS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:50:28 -07:00
Erich Blume	1631e11137	Add /user/ to forge robots.txt exclusion All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m47s Details Crawlers follow auth redirects to /user/login which is pointless for them. Saves round-trips for both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:34:24 -07:00
Erich Blume	7a42aeb77c	Mitigate Forgejo archive endpoint DoS from crawler abuse All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m35s Details Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers: - Redirect archive requests to tailnet at Fly proxy (302) - Expand robots.txt: block /users/, //archive/, //releases/download/ - Cache release artifact downloads at nginx (immutable, 7d TTL) - Enable [cron.archive_cleanup] with 2h TTL and run-at-start Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:21:22 -07:00
Erich Blume	7f6bbdc82c	Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/ All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m19s Details Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing down the Forgejo instance. Serve robots.txt directly from nginx to disallow /mirrors/ while leaving eblume/* accessible to crawlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:39:48 -07:00
Erich Blume	e02305e72d	Pin Fly.io Tailscale to v1.94.1 to fix MagicDNS regression in v1.96.5 All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m20s Details Tailscale :stable pulled v1.96.5 during last deploy, which returns SERVFAIL for tailnet DNS names (no upstream resolvers set). This broke all public routing (forge/docs/cv.eblu.me) through the Fly proxy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:32:38 -07:00
Erich Blume	a75f28e073	Fix fly.io proxy rate limit to key on real client IP All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m24s Details The general rate limit zone used $binary_remote_addr (Fly's internal proxy IP), causing all external clients to share one bucket. Switch to $http_fly_client_ip to match forge_auth's correct behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:00:33 -07:00
Erich Blume	0d422f5234	Update tooling dependencies (March 2026) (#307 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m51s Details ## Summary Monthly tooling dependency update per [[update-tooling-dependencies]]. - Prek hooks: trufflehog v3.93.4→v3.94.0, ruff v0.15.2→v0.15.7, shfmt v3.12.0-2→v3.13.0-1, ansible-lint floor→26.3.0, ansible-core floor→2.18 - Fly.io proxy: nginx 1.28.2→1.29.6, Grafana Alloy v1.13.1→v1.14.1 - Forgejo workflows: actions/checkout v4.3.1→v6.0.2 (SHA-pinned across all 5 workflows) - Mise tasks: tightened Python lower bounds — rich≥14.0.0, typer≥0.24.0, httpx≥0.28.1, pyyaml≥6.0.2 ## Test plan - [x] `prek run --all-files` passes - [ ] Verify Fly.io deploy succeeds after merge (nginx minor bump + Alloy bump) - [ ] Spot-check a workflow run with the new actions/checkout v6 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #307	2026-03-24 08:11:46 -07:00
Erich Blume	044ad7dad7	Revert fly/start.sh to polling loop — tailscale wait needs v1.96.2+ All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m33s Details The Fly container pulls from tailscale/tailscale:stable which is still v1.94.2. The `tailscale wait` command doesn't exist until v1.96.2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 19:44:47 -07:00
Erich Blume	2e46f99820	Upgrade Tailscale operator v1.94.2 → v1.96.3 (#304 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 7m0s Details ## Summary - Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization) - Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2) - Stamp kube-state-metrics review date (already current at v2.18.0) ## Notable upstream changes (v1.94.2 → v1.96.3) - Go upgraded from 1.25 to 1.26 - `tailscale wait` command — blocks until daemon is running + interface has IP - AuthKey policy now applies only when users are not logged in (behavioral change) - Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling) - UPnP stability fixes ## Deploy plan 1. Merge PR 2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator` 3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...` 4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods` 5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services) 6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change) ## Test plan - [ ] ArgoCD diff looks clean for both apps before sync - [ ] Proxy pods on indri come up healthy with v1.96.3 images - [ ] Proxy pods on ringtail come up healthy with v1.96.3 images - [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus) - [ ] Fly proxy rebuild deploys successfully with `tailscale wait` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #304	2026-03-22 19:31:22 -07:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	cb9a06bb75	Update tooling dependencies (Feb 2026 cycle) (#254 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m30s Details ## Summary Monthly tooling dependency update cycle: - Pre-commit hooks: trufflehog v3.92.5→v3.93.4, ruff v0.14.13→v0.15.2, shellcheck v0.10.0.1→v0.11.0.1, prettier v3.8.0→v3.8.1, actionlint v1.7.10→v1.7.11 - Fly.io Dockerfile: pin nginx to 1.28.2-alpine (was unpinned), bump alloy v1.5.1→v1.13.1 - Mise tasks: normalize httpx lower bound to >=0.28.0 and typer to >=0.15.0 across all scripts - Forgejo workflows: actions/checkout@v4 is current, no changes needed - New how-to doc: [[update-tooling-dependencies]] documenting this monthly cycle ## No changes needed - pre-commit-hooks v6.0.0, yamllint v1.38.0, shfmt v3.12.0-2, taplo v0.9.3, ansible-lint 26.1.1 — all already at latest ## Test plan - [x] `uvx pre-commit run --all-files` — all 24 hooks pass - [ ] Fly.io deploy (triggered automatically on merge to main via deploy-fly workflow) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/254	2026-02-23 13:08:41 -08:00
Erich Blume	9c789a1868	Fix cache hit rate on APM and Fly.io dashboards (#177 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m19s Details ## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177	2026-02-12 18:40:48 -08:00
Erich Blume	9717863f65	Update CV release to v1.0.3, add X-Clacks-Overhead header (#176 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m5s Details ## Summary - Update CV release URL from v1.0.2 to v1.0.3 - Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config ## Deployment and Testing - [ ] Sync CV app: `argocd app sync cv` - [ ] Verify CV is serving v1.0.3 content - [ ] Deploy fly proxy (workflow or `mise run fly-deploy`) - [ ] Verify header: `curl -sI https://docs.eblu.me \| grep -i clacks` - [ ] Verify header: `curl -sI https://cv.eblu.me \| grep -i clacks` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176	2026-02-12 17:08:22 -08:00
Erich Blume	df372fccb6	Expose CV publicly at cv.eblu.me (#173 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m57s Details ## Summary - Add nginx server block for `cv.eblu.me` (static site, same pattern as docs) - Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`) - Add `cv.eblu.me` cert to `fly-setup` mise task - Tag CV Tailscale ingress with `tag:flyio-target` for ACL access - Remove `/_error` test endpoint from docs proxy ## Deployment and Testing - [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv` - [ ] `fly certs add cv.eblu.me -a blumeops-proxy` - [ ] `mise run fly-deploy` - [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/` - [ ] `mise run dns-preview` then `mise run dns-up` - [ ] Verify live: `curl -I https://cv.eblu.me` - [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173	2026-02-12 14:05:00 -08:00
Erich Blume	834c9fa57b	Bump Fly.io proxy VM to 512MB, fix TruffleHog scanning (#152 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Bump Fly.io proxy VM memory from 256MB to 512MB — Alloy was OOM-killed, causing the Grafana Fly.io dashboard to lose metrics - Fix TruffleHog pre-commit hook to scan only staged changes (`--since-commit HEAD`) instead of full repo history - Sanitize example credential URL in Reolink camera plan doc ## Deployment and Testing - [ ] Fly.io deploy triggers automatically on merge (workflow watches `fly/**`) - [ ] After deploy, verify Alloy is running: `fly ssh console -a blumeops-proxy -C "ps aux"` should show alloy process - [ ] Grafana Fly.io dashboard should start populating within ~1 minute Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/152	2026-02-11 12:03:51 -08:00
Erich Blume	4ee643a81d	Serve friendly error page when Fly.io proxy upstreams are unreachable (#133 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m50s Details ## Summary - Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.) - Stale cache is still served first when available (`proxy_cache_use_stale` takes priority) - Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams - `proxy_intercept_errors on` also catches error responses returned by the upstream itself ## Files Changed - `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo) - `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors` - `fly/Dockerfile` — COPY error.html into image ## Test Plan - [ ] Deploy to Fly.io - [ ] Visit `docs.eblu.me/_error` to verify the page renders - [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133	2026-02-09 12:01:24 -08:00
Erich Blume	959b6842bc	Zero-downtime Fly.io deploys (#132 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m40s Details ## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132	2026-02-09 11:34:19 -08:00
Erich Blume	bd61da4f85	Fix 502 errors during Fly.io proxy deploys (#131 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m20s Details ## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131	2026-02-09 11:07:36 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Erich Blume	c6f8fcd346	Fix fly-deploy WARNING by starting nginx before Tailscale (#128 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m3s Details ## Summary - Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy` - Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time - DNS results cached 30s per worker — no per-request lookup overhead ## Context The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy. ## Test plan - [ ] Merge and let the deploy-fly workflow trigger - [ ] Check runner logs for absence of the WARNING - [ ] Verify `docs.eblu.me` still serves correctly - [ ] Verify `/healthz` still passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128	2026-02-09 07:01:58 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00

35 commits