Commit graph

6 commits

Author SHA1 Message Date
a87c997ee1 Expose Forgejo publicly at forge.eblu.me (#278)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m28s
## Summary

Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service.

- **Forgejo hardening:** Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO)
- **Tailscale Ingress:** ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint
- **Fly.io proxy:** nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit
- **Authentik:** OAuth callback updated to forge.eblu.me
- **DNS/TLS:** CNAME record in Pulumi, cert in fly-setup
- **Rename:** ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is)

## Deployment Order

1. `mise run provision-indri -- --tags forgejo` (config changes)
2. Verify forge.ops.eblu.me still works
3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator`
4. Verify `curl https://forge.tail8d86e.ts.net`
5. `cd fly && fly deploy`
6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/`
7. `fly certs add forge.eblu.me -a blumeops-proxy`
8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik`
9. `mise run dns-preview && mise run dns-up`
10. Full verification (see below)
11. Rehearse `mise run fly-shutoff`
12. After merge: reset ArgoCD revisions to main, re-sync

## Verification Checklist

- [ ] forge.eblu.me loads, shows public repos
- [ ] forge.ops.eblu.me still works from tailnet
- [ ] SSH clone via forge.ops.eblu.me:2222 works
- [ ] HTTPS clone via forge.eblu.me works
- [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH
- [ ] /swagger returns 403
- [ ] Rapid login attempts trigger 429 rate limit
- [ ] fail2ban bans after 5 failed logins in 10 minutes
- [ ] ArgoCD can still sync (SSH unaffected)
- [ ] `mise run fly-shutoff` stops all public traffic
- [ ] `mise run services-check` passes

Reviewed-on: #278
2026-03-03 08:40:41 -08:00
959b6842bc Zero-downtime Fly.io deploys (#132)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m40s
## Summary
- Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
- Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy
- Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379))
- Remove sentinel file and nginx if-check (no longer needed)

Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm)
- [ ] Confirm `fly checks list` shows the new service-level check passing

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132
2026-02-09 11:34:19 -08:00
bd61da4f85 Fix 502 errors during Fly.io proxy deploys (#131)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m20s
## Summary
- Health check (`/healthz`) now returns 503 until Tailscale is connected
- `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds
- Fly.io keeps the old machine serving traffic during the ~7s startup window

Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard)
- [ ] Confirm health check transitions from 503 → 200 in `fly logs`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
2026-02-09 11:07:36 -08:00
c6f8fcd346 Fix fly-deploy WARNING by starting nginx before Tailscale (#128)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m3s
## Summary
- Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy`
- Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time
- DNS results cached 30s per worker — no per-request lookup overhead

## Context
The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy.

## Test plan
- [ ] Merge and let the deploy-fly workflow trigger
- [ ] Check runner logs for absence of the WARNING
- [ ] Verify `docs.eblu.me` still serves correctly
- [ ] Verify `/healthz` still passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128
2026-02-09 07:01:58 -08:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00