Commit graph

17 commits

Author SHA1 Message Date
903db4079d Fix upstream keepalive: set proxy_ssl_name for correct SNI
With upstream blocks, nginx sends the block name as SNI instead of
the actual hostname. The Tailscale Ingress proxy needs the correct
SNI to route TLS connections. Add explicit proxy_ssl_name for each
upstream, and set Host header for docs/cv backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:51:51 -07:00
6a1d9cc0bf Switch Fly proxy to upstream keepalive pools
Replace per-request DNS resolution (variable-based proxy_pass) with
static upstream blocks and keepalive connection pools. This reuses
TLS connections through the Tailscale tunnel instead of handshaking
per request, which should significantly reduce latency at >1 req/s.

Trade-off: DNS is resolved at config load, not per-request. If
Tailscale Ingress pods get new IPs, run `mise run fly-reload` to
re-resolve.

Also adds mise-tasks/fly-reload for nginx config reload without
full redeploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:42:57 -07:00
54b1cee950 Fix Connection header: only send 'upgrade' for WebSocket requests
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 1m35s
Was sending Connection: upgrade on every proxied request, which is
semantically wrong for normal HTTP traffic. Use a map to conditionally
send 'upgrade' only when the client requests a WebSocket switch,
'close' otherwise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:27:40 -07:00
1631e11137 Add /user/ to forge robots.txt exclusion
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m47s
Crawlers follow auth redirects to /user/login which is pointless for them.
Saves round-trips for both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:34:24 -07:00
7a42aeb77c Mitigate Forgejo archive endpoint DoS from crawler abuse
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m35s
Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB
of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers:

- Redirect archive requests to tailnet at Fly proxy (302)
- Expand robots.txt: block /users/, /*/archive/, /*/releases/download/
- Cache release artifact downloads at nginx (immutable, 7d TTL)
- Enable [cron.archive_cleanup] with 2h TTL and run-at-start

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:21:22 -07:00
7f6bbdc82c Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m19s
Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing
down the Forgejo instance. Serve robots.txt directly from nginx to
disallow /mirrors/ while leaving eblume/* accessible to crawlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:39:48 -07:00
a75f28e073 Fix fly.io proxy rate limit to key on real client IP
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m24s
The general rate limit zone used $binary_remote_addr (Fly's internal
proxy IP), causing all external clients to share one bucket. Switch to
$http_fly_client_ip to match forge_auth's correct behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:00:33 -07:00
a87c997ee1 Expose Forgejo publicly at forge.eblu.me (#278)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m28s
## Summary

Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service.

- **Forgejo hardening:** Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO)
- **Tailscale Ingress:** ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint
- **Fly.io proxy:** nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit
- **Authentik:** OAuth callback updated to forge.eblu.me
- **DNS/TLS:** CNAME record in Pulumi, cert in fly-setup
- **Rename:** ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is)

## Deployment Order

1. `mise run provision-indri -- --tags forgejo` (config changes)
2. Verify forge.ops.eblu.me still works
3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator`
4. Verify `curl https://forge.tail8d86e.ts.net`
5. `cd fly && fly deploy`
6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/`
7. `fly certs add forge.eblu.me -a blumeops-proxy`
8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik`
9. `mise run dns-preview && mise run dns-up`
10. Full verification (see below)
11. Rehearse `mise run fly-shutoff`
12. After merge: reset ArgoCD revisions to main, re-sync

## Verification Checklist

- [ ] forge.eblu.me loads, shows public repos
- [ ] forge.ops.eblu.me still works from tailnet
- [ ] SSH clone via forge.ops.eblu.me:2222 works
- [ ] HTTPS clone via forge.eblu.me works
- [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH
- [ ] /swagger returns 403
- [ ] Rapid login attempts trigger 429 rate limit
- [ ] fail2ban bans after 5 failed logins in 10 minutes
- [ ] ArgoCD can still sync (SSH unaffected)
- [ ] `mise run fly-shutoff` stops all public traffic
- [ ] `mise run services-check` passes

Reviewed-on: #278
2026-03-03 08:40:41 -08:00
9717863f65 Update CV release to v1.0.3, add X-Clacks-Overhead header (#176)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m5s
## Summary
- Update CV release URL from v1.0.2 to v1.0.3
- Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config

## Deployment and Testing
- [ ] Sync CV app: `argocd app sync cv`
- [ ] Verify CV is serving v1.0.3 content
- [ ] Deploy fly proxy (workflow or `mise run fly-deploy`)
- [ ] Verify header: `curl -sI https://docs.eblu.me | grep -i clacks`
- [ ] Verify header: `curl -sI https://cv.eblu.me | grep -i clacks`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176
2026-02-12 17:08:22 -08:00
df372fccb6 Expose CV publicly at cv.eblu.me (#173)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m57s
## Summary
- Add nginx server block for `cv.eblu.me` (static site, same pattern as docs)
- Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`)
- Add `cv.eblu.me` cert to `fly-setup` mise task
- Tag CV Tailscale ingress with `tag:flyio-target` for ACL access
- Remove `/_error` test endpoint from docs proxy

## Deployment and Testing
- [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv`
- [ ] `fly certs add cv.eblu.me -a blumeops-proxy`
- [ ] `mise run fly-deploy`
- [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/`
- [ ] `mise run dns-preview` then `mise run dns-up`
- [ ] Verify live: `curl -I https://cv.eblu.me`
- [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173
2026-02-12 14:05:00 -08:00
4ee643a81d Serve friendly error page when Fly.io proxy upstreams are unreachable (#133)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m50s
## Summary
- Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.)
- Stale cache is still served first when available (`proxy_cache_use_stale` takes priority)
- Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams
- `proxy_intercept_errors on` also catches error responses returned by the upstream itself

## Files Changed
- `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo)
- `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors`
- `fly/Dockerfile` — COPY error.html into image

## Test Plan
- [ ] Deploy to Fly.io
- [ ] Visit `docs.eblu.me/_error` to verify the page renders
- [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133
2026-02-09 12:01:24 -08:00
959b6842bc Zero-downtime Fly.io deploys (#132)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m40s
## Summary
- Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
- Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy
- Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379))
- Remove sentinel file and nginx if-check (no longer needed)

Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm)
- [ ] Confirm `fly checks list` shows the new service-level check passing

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132
2026-02-09 11:34:19 -08:00
bd61da4f85 Fix 502 errors during Fly.io proxy deploys (#131)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m20s
## Summary
- Health check (`/healthz`) now returns 503 until Tailscale is connected
- `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds
- Fly.io keeps the old machine serving traffic during the ~7s startup window

Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard)
- [ ] Confirm health check transitions from 503 → 200 in `fly logs`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
2026-02-09 11:07:36 -08:00
3415cad38c Log real client IPs via Fly-Client-IP header (#130)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 59s
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging

Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.

## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
2026-02-09 11:02:06 -08:00
c6f8fcd346 Fix fly-deploy WARNING by starting nginx before Tailscale (#128)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m3s
## Summary
- Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy`
- Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time
- DNS results cached 30s per worker — no per-request lookup overhead

## Context
The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy.

## Test plan
- [ ] Merge and let the deploy-fly workflow trigger
- [ ] Check runner logs for absence of the WARNING
- [ ] Verify `docs.eblu.me` still serves correctly
- [ ] Verify `/healthz` still passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128
2026-02-09 07:01:58 -08:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00