With upstream blocks, nginx sends the block name as SNI instead of
the actual hostname. The Tailscale Ingress proxy needs the correct
SNI to route TLS connections. Add explicit proxy_ssl_name for each
upstream, and set Host header for docs/cv backends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upstream blocks resolve DNS at config load. If MagicDNS isn't ready yet
(Tailscale just connected), nginx gets empty resolution and returns 502.
Poll nslookup until resolution works before launching nginx.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-request DNS resolution (variable-based proxy_pass) with
static upstream blocks and keepalive connection pools. This reuses
TLS connections through the Tailscale tunnel instead of handshaking
per request, which should significantly reduce latency at >1 req/s.
Trade-off: DNS is resolved at config load, not per-request. If
Tailscale Ingress pods get new IPs, run `mise run fly-reload` to
re-resolve.
Also adds mise-tasks/fly-reload for nginx config reload without
full redeploy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was sending Connection: upgrade on every proxied request, which is
semantically wrong for normal HTTP traffic. Use a map to conditionally
send 'upgrade' only when the client requests a WebSocket switch,
'close' otherwise.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous max bucket was 10s — all slower requests collapsed into +Inf,
making p50/p90/p99 unreadable during the Forgejo archive DoS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crawlers follow auth redirects to /user/login which is pointless for them.
Saves round-trips for both sides.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing
down the Forgejo instance. Serve robots.txt directly from nginx to
disallow /mirrors/ while leaving eblume/* accessible to crawlers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tailscale :stable pulled v1.96.5 during last deploy, which returns
SERVFAIL for tailnet DNS names (no upstream resolvers set). This broke
all public routing (forge/docs/cv.eblu.me) through the Fly proxy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The general rate limit zone used $binary_remote_addr (Fly's internal
proxy IP), causing all external clients to share one bucket. Switch to
$http_fly_client_ip to match forge_auth's correct behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Fly container pulls from tailscale/tailscale:stable which is still
v1.94.2. The `tailscale wait` command doesn't exist until v1.96.2.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization)
- Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2)
- Stamp kube-state-metrics review date (already current at v2.18.0)
## Notable upstream changes (v1.94.2 → v1.96.3)
- Go upgraded from 1.25 to 1.26
- `tailscale wait` command — blocks until daemon is running + interface has IP
- AuthKey policy now applies only when users are not logged in (behavioral change)
- Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling)
- UPnP stability fixes
## Deploy plan
1. Merge PR
2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator`
3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...`
4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods`
5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services)
6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change)
## Test plan
- [ ] ArgoCD diff looks clean for both apps before sync
- [ ] Proxy pods on indri come up healthy with v1.96.3 images
- [ ] Proxy pods on ringtail come up healthy with v1.96.3 images
- [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus)
- [ ] Fly proxy rebuild deploys successfully with `tailscale wait`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #304
## Summary
- Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`)
- Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
- Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red
## Root cause
Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).
## Deployment
- Fly.io proxy redeploy needed for Alloy config change
- ArgoCD sync for dashboard ConfigMap changes
## Test plan
- [ ] Redeploy Fly.io proxy
- [ ] Sync grafana-config in ArgoCD
- [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%)
- [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
- [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177
## Summary
- Bump Fly.io proxy VM memory from 256MB to 512MB — Alloy was OOM-killed, causing the Grafana Fly.io dashboard to lose metrics
- Fix TruffleHog pre-commit hook to scan only staged changes (`--since-commit HEAD`) instead of full repo history
- Sanitize example credential URL in Reolink camera plan doc
## Deployment and Testing
- [ ] Fly.io deploy triggers automatically on merge (workflow watches `fly/**`)
- [ ] After deploy, verify Alloy is running: `fly ssh console -a blumeops-proxy -C "ps aux"` should show alloy process
- [ ] Grafana Fly.io dashboard should start populating within ~1 minute
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/152
## Summary
- Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.)
- Stale cache is still served first when available (`proxy_cache_use_stale` takes priority)
- Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams
- `proxy_intercept_errors on` also catches error responses returned by the upstream itself
## Files Changed
- `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo)
- `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors`
- `fly/Dockerfile` — COPY error.html into image
## Test Plan
- [ ] Deploy to Fly.io
- [ ] Visit `docs.eblu.me/_error` to verify the page renders
- [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133
## Summary
- Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
- Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy
- Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379))
- Remove sentinel file and nginx if-check (no longer needed)
Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.
## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm)
- [ ] Confirm `fly checks list` shows the new service-level check passing
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132
## Summary
- Health check (`/healthz`) now returns 503 until Tailscale is connected
- `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds
- Fly.io keeps the old machine serving traffic during the ~7s startup window
Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected.
## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard)
- [ ] Confirm health check transitions from 503 → 200 in `fly logs`
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging
Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.
## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
## Summary
- Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy`
- Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time
- DNS results cached 30s per worker — no per-request lookup overhead
## Context
The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy.
## Test plan
- [ ] Merge and let the deploy-fly workflow trigger
- [ ] Check runner logs for absence of the WARNING
- [ ] Verify `docs.eblu.me` still serves correctly
- [ ] Verify `/healthz` still passes
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128
## Summary
- Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy
- Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test
- Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses
- Switch Alloy push endpoints from `*.ops.eblu.me` (Caddy) to `*.tail8d86e.ts.net` (Tailscale Ingress)
- Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly
## Manual step (not in PR)
Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes.
## Deployment order
1. **Pulumi ACLs** — `mise run tailnet-preview && mise run tailnet-up`
2. **OAuth client** — Manual update in Tailscale admin console
3. **K8s Ingresses** — `argocd app sync apps && argocd app sync docs loki prometheus`
4. **Fly.io proxy** — `mise run fly-deploy`
5. **Verify** — `mise run services-check`, check Grafana dashboards
## Test plan
- [ ] `mise run tailnet-preview` shows clean diff
- [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions
- [ ] After deploy: Grafana dashboards show continued log/metric flow
- [ ] `curl -sf https://docs.eblu.me` returns 200
- [ ] `mise run services-check` passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126
## Summary
- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow
## Key details
- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`
## Test plan
- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120