blumeops

Author	SHA1	Message	Date
Erich Blume	4ee643a81d	Serve friendly error page when Fly.io proxy upstreams are unreachable (#133 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m50s Details ## Summary - Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.) - Stale cache is still served first when available (`proxy_cache_use_stale` takes priority) - Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams - `proxy_intercept_errors on` also catches error responses returned by the upstream itself ## Files Changed - `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo) - `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors` - `fly/Dockerfile` — COPY error.html into image ## Test Plan - [ ] Deploy to Fly.io - [ ] Visit `docs.eblu.me/_error` to verify the page renders - [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133	2026-02-09 12:01:24 -08:00
Erich Blume	959b6842bc	Zero-downtime Fly.io deploys (#132 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m40s Details ## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132	2026-02-09 11:34:19 -08:00
Erich Blume	bd61da4f85	Fix 502 errors during Fly.io proxy deploys (#131 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m20s Details ## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131	2026-02-09 11:07:36 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Erich Blume	c6f8fcd346	Fix fly-deploy WARNING by starting nginx before Tailscale (#128 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m3s Details ## Summary - Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy` - Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time - DNS results cached 30s per worker — no per-request lookup overhead ## Context The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy. ## Test plan - [ ] Merge and let the deploy-fly workflow trigger - [ ] Check runner logs for absence of the WARNING - [ ] Verify `docs.eblu.me` still serves correctly - [ ] Verify `/healthz` still passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128	2026-02-09 07:01:58 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00

8 commits