blumeops

Author	SHA1	Message	Date
Erich Blume	903db4079d	Fix upstream keepalive: set proxy_ssl_name for correct SNI With upstream blocks, nginx sends the block name as SNI instead of the actual hostname. The Tailscale Ingress proxy needs the correct SNI to route TLS connections. Add explicit proxy_ssl_name for each upstream, and set Host header for docs/cv backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:51:51 -07:00
Erich Blume	1236d381eb	Wait for MagicDNS readiness before starting nginx Upstream blocks resolve DNS at config load. If MagicDNS isn't ready yet (Tailscale just connected), nginx gets empty resolution and returns 502. Poll nslookup until resolution works before launching nginx. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:47:21 -07:00
Erich Blume	6a1d9cc0bf	Switch Fly proxy to upstream keepalive pools Replace per-request DNS resolution (variable-based proxy_pass) with static upstream blocks and keepalive connection pools. This reuses TLS connections through the Tailscale tunnel instead of handshaking per request, which should significantly reduce latency at >1 req/s. Trade-off: DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs, run `mise run fly-reload` to re-resolve. Also adds mise-tasks/fly-reload for nginx config reload without full redeploy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:42:57 -07:00
Erich Blume	54b1cee950	Fix Connection header: only send 'upgrade' for WebSocket requests Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 1m35s Details Was sending Connection: upgrade on every proxied request, which is semantically wrong for normal HTTP traffic. Use a map to conditionally send 'upgrade' only when the client requests a WebSocket switch, 'close' otherwise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:27:40 -07:00
Erich Blume	d7af004842	Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m53s Details - Enable Forgejo /metrics endpoint (app.ini [metrics] section) - Add Alloy scrape target for Forgejo metrics on indri - Add upstream_response_time histogram to Fly proxy Alloy config - Replace single p95 panel with p50/p90/p99 + upstream breakdown filtered to forge.eblu.me host Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:05:59 -07:00
Erich Blume	8fccbda573	Extend Fly proxy latency histogram buckets to 60s All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m29s Details Previous max bucket was 10s — all slower requests collapsed into +Inf, making p50/p90/p99 unreadable during the Forgejo archive DoS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:50:28 -07:00
Erich Blume	1631e11137	Add /user/ to forge robots.txt exclusion All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m47s Details Crawlers follow auth redirects to /user/login which is pointless for them. Saves round-trips for both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:34:24 -07:00
Erich Blume	7a42aeb77c	Mitigate Forgejo archive endpoint DoS from crawler abuse All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m35s Details Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers: - Redirect archive requests to tailnet at Fly proxy (302) - Expand robots.txt: block /users/, //archive/, //releases/download/ - Cache release artifact downloads at nginx (immutable, 7d TTL) - Enable [cron.archive_cleanup] with 2h TTL and run-at-start Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:21:22 -07:00
Erich Blume	7f6bbdc82c	Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/ All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m19s Details Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing down the Forgejo instance. Serve robots.txt directly from nginx to disallow /mirrors/ while leaving eblume/* accessible to crawlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:39:48 -07:00
Erich Blume	e02305e72d	Pin Fly.io Tailscale to v1.94.1 to fix MagicDNS regression in v1.96.5 All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m20s Details Tailscale :stable pulled v1.96.5 during last deploy, which returns SERVFAIL for tailnet DNS names (no upstream resolvers set). This broke all public routing (forge/docs/cv.eblu.me) through the Fly proxy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:32:38 -07:00
Erich Blume	a75f28e073	Fix fly.io proxy rate limit to key on real client IP All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m24s Details The general rate limit zone used $binary_remote_addr (Fly's internal proxy IP), causing all external clients to share one bucket. Switch to $http_fly_client_ip to match forge_auth's correct behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:00:33 -07:00
Erich Blume	0d422f5234	Update tooling dependencies (March 2026) (#307 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m51s Details ## Summary Monthly tooling dependency update per [[update-tooling-dependencies]]. - Prek hooks: trufflehog v3.93.4→v3.94.0, ruff v0.15.2→v0.15.7, shfmt v3.12.0-2→v3.13.0-1, ansible-lint floor→26.3.0, ansible-core floor→2.18 - Fly.io proxy: nginx 1.28.2→1.29.6, Grafana Alloy v1.13.1→v1.14.1 - Forgejo workflows: actions/checkout v4.3.1→v6.0.2 (SHA-pinned across all 5 workflows) - Mise tasks: tightened Python lower bounds — rich≥14.0.0, typer≥0.24.0, httpx≥0.28.1, pyyaml≥6.0.2 ## Test plan - [x] `prek run --all-files` passes - [ ] Verify Fly.io deploy succeeds after merge (nginx minor bump + Alloy bump) - [ ] Spot-check a workflow run with the new actions/checkout v6 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #307	2026-03-24 08:11:46 -07:00
Erich Blume	044ad7dad7	Revert fly/start.sh to polling loop — tailscale wait needs v1.96.2+ All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m33s Details The Fly container pulls from tailscale/tailscale:stable which is still v1.94.2. The `tailscale wait` command doesn't exist until v1.96.2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 19:44:47 -07:00
Erich Blume	2e46f99820	Upgrade Tailscale operator v1.94.2 → v1.96.3 (#304 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 7m0s Details ## Summary - Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization) - Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2) - Stamp kube-state-metrics review date (already current at v2.18.0) ## Notable upstream changes (v1.94.2 → v1.96.3) - Go upgraded from 1.25 to 1.26 - `tailscale wait` command — blocks until daemon is running + interface has IP - AuthKey policy now applies only when users are not logged in (behavioral change) - Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling) - UPnP stability fixes ## Deploy plan 1. Merge PR 2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator` 3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...` 4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods` 5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services) 6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change) ## Test plan - [ ] ArgoCD diff looks clean for both apps before sync - [ ] Proxy pods on indri come up healthy with v1.96.3 images - [ ] Proxy pods on ringtail come up healthy with v1.96.3 images - [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus) - [ ] Fly proxy rebuild deploys successfully with `tailscale wait` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #304	2026-03-22 19:31:22 -07:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	cb9a06bb75	Update tooling dependencies (Feb 2026 cycle) (#254 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m30s Details ## Summary Monthly tooling dependency update cycle: - Pre-commit hooks: trufflehog v3.92.5→v3.93.4, ruff v0.14.13→v0.15.2, shellcheck v0.10.0.1→v0.11.0.1, prettier v3.8.0→v3.8.1, actionlint v1.7.10→v1.7.11 - Fly.io Dockerfile: pin nginx to 1.28.2-alpine (was unpinned), bump alloy v1.5.1→v1.13.1 - Mise tasks: normalize httpx lower bound to >=0.28.0 and typer to >=0.15.0 across all scripts - Forgejo workflows: actions/checkout@v4 is current, no changes needed - New how-to doc: [[update-tooling-dependencies]] documenting this monthly cycle ## No changes needed - pre-commit-hooks v6.0.0, yamllint v1.38.0, shfmt v3.12.0-2, taplo v0.9.3, ansible-lint 26.1.1 — all already at latest ## Test plan - [x] `uvx pre-commit run --all-files` — all 24 hooks pass - [ ] Fly.io deploy (triggered automatically on merge to main via deploy-fly workflow) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/254	2026-02-23 13:08:41 -08:00
Erich Blume	9c789a1868	Fix cache hit rate on APM and Fly.io dashboards (#177 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m19s Details ## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177	2026-02-12 18:40:48 -08:00
Erich Blume	9717863f65	Update CV release to v1.0.3, add X-Clacks-Overhead header (#176 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m5s Details ## Summary - Update CV release URL from v1.0.2 to v1.0.3 - Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config ## Deployment and Testing - [ ] Sync CV app: `argocd app sync cv` - [ ] Verify CV is serving v1.0.3 content - [ ] Deploy fly proxy (workflow or `mise run fly-deploy`) - [ ] Verify header: `curl -sI https://docs.eblu.me \| grep -i clacks` - [ ] Verify header: `curl -sI https://cv.eblu.me \| grep -i clacks` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176	2026-02-12 17:08:22 -08:00
Erich Blume	df372fccb6	Expose CV publicly at cv.eblu.me (#173 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m57s Details ## Summary - Add nginx server block for `cv.eblu.me` (static site, same pattern as docs) - Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`) - Add `cv.eblu.me` cert to `fly-setup` mise task - Tag CV Tailscale ingress with `tag:flyio-target` for ACL access - Remove `/_error` test endpoint from docs proxy ## Deployment and Testing - [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv` - [ ] `fly certs add cv.eblu.me -a blumeops-proxy` - [ ] `mise run fly-deploy` - [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/` - [ ] `mise run dns-preview` then `mise run dns-up` - [ ] Verify live: `curl -I https://cv.eblu.me` - [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173	2026-02-12 14:05:00 -08:00
Erich Blume	834c9fa57b	Bump Fly.io proxy VM to 512MB, fix TruffleHog scanning (#152 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Bump Fly.io proxy VM memory from 256MB to 512MB — Alloy was OOM-killed, causing the Grafana Fly.io dashboard to lose metrics - Fix TruffleHog pre-commit hook to scan only staged changes (`--since-commit HEAD`) instead of full repo history - Sanitize example credential URL in Reolink camera plan doc ## Deployment and Testing - [ ] Fly.io deploy triggers automatically on merge (workflow watches `fly/**`) - [ ] After deploy, verify Alloy is running: `fly ssh console -a blumeops-proxy -C "ps aux"` should show alloy process - [ ] Grafana Fly.io dashboard should start populating within ~1 minute Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/152	2026-02-11 12:03:51 -08:00
Erich Blume	4ee643a81d	Serve friendly error page when Fly.io proxy upstreams are unreachable (#133 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m50s Details ## Summary - Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.) - Stale cache is still served first when available (`proxy_cache_use_stale` takes priority) - Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams - `proxy_intercept_errors on` also catches error responses returned by the upstream itself ## Files Changed - `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo) - `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors` - `fly/Dockerfile` — COPY error.html into image ## Test Plan - [ ] Deploy to Fly.io - [ ] Visit `docs.eblu.me/_error` to verify the page renders - [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133	2026-02-09 12:01:24 -08:00
Erich Blume	959b6842bc	Zero-downtime Fly.io deploys (#132 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m40s Details ## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132	2026-02-09 11:34:19 -08:00
Erich Blume	bd61da4f85	Fix 502 errors during Fly.io proxy deploys (#131 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m20s Details ## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131	2026-02-09 11:07:36 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Erich Blume	c6f8fcd346	Fix fly-deploy WARNING by starting nginx before Tailscale (#128 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m3s Details ## Summary - Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy` - Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time - DNS results cached 30s per worker — no per-request lookup overhead ## Context The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy. ## Test plan - [ ] Merge and let the deploy-fly workflow trigger - [ ] Check runner logs for absence of the WARNING - [ ] Verify `docs.eblu.me` still serves correctly - [ ] Verify `/healthz` still passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128	2026-02-09 07:01:58 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00

28 commits